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1  Introduction 


A  key  challenge  in  organizational  research  is  accounting  for  the  existence  of  levels,  i.e.,  distinguishing  be¬ 
tween  the  effects  on  individual  and  organizational  behavior,  when  analyzing  a  particular  process.  Multilevel 
modeling  has  proven  to  be  an  effective  method  for  investigating  the  effects  at  several  levels  within  an  orga¬ 
nizational  hierarchy.  In  today’s  commericial  and  open-source  software,  several  tools  exist  that  can  estimate 
a  multilevel  model  for  a  data  set  of  continuous  measurements.  For  data  sets  with  binomial  or  categorical 
measurements,  multilevel  modeling  requires  estimating  a  generalized  linear  mixed  model  (GLMM).  Cur¬ 
rently,  extant  software  tools  for  estimating  a  GLMM  are  very  limited  in  that  the  numerical  methods  used 
can  be  slow,  inaccurate,  or  both.  In  this  work,  Toyon  Research  Corporation  and  the  University  of  Wisconsin 
are  developing  a  software  package  for  estimating  a  GLMM  in  R,  an  open-source  statistical  environment. 
For  Phase  I,  we  have  developed  an  approach  that  fits  an  approximate  model,  but  uses  methods  that  can  be 
extended  for  maximum  likelihood  (ML)  methods  that  we  will  pursue  in  Phase  II. 

2  Body 

As  discussed  in  our  Phase  I  proposal,  there  are  three  main  objectives  for  the  Phase  I  work: 

1 .  Develop  a  preliminary  version  of  the  code  for  estimating  a  GLMM,  which  can  be  used  as  a  foundation 
for  Phase  II. 

2.  Demonstrate  the  code  using  simulated  data  sets 

3.  Identify  a  suitable  approach  for  modeling  ordinal  data  sets 

4.  Develop  a  suitable  interface 

We  elected  to  use  the  standard  R  interface  for  a  function,  and  not  to  develop  any  specialized  or  graphical 
user  interface. 

As  discussed  in  the  proposal  for  Phase  I,  Penalized  Quasi  Likelihood  (PQL)  is  an  approximate  method 
that  can  estimate  a  GLMM  for  binary  data.  We  proposed  to  use  this  approximate  method  to  estimate  an 
initial  solution,  which  can  be  further  refined  using  a  more  accurate,  but  computationally  intensive  procedure. 
The  Phase  I  effort  focused  on  using  PQL,  with  extensions  to  other  methods  to  be  developed  in  Phase  II. 

Code  development  was  the  major  effort  in  Phase  I,  and  consisted  of  some  low-level  R  programming  and 
theoretical  work  on  the  numerical  methods  for  GLMM.  The  programming  tasks  consisted  of  re-implementing 
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the  Ime  function  in  R,  to  take  advantage  of  S4-classes  and  better  linear  algebra  solvers.  Because  the  lme 
function  is  called  repeatedly  in  several  iterations  for  estimating  a  GLMM,  it  is  the  “long  pole  in  the  tent”  in 
terms  of  speed  and  accuracy. 

The  PQL  method  directly  works  for  binary  data,  but  does  not  clearly  extend  for  categorical  data  (i.e., 
where  the  response  variable  can  be  more  than  two  values).  For  Phase  I,  we  focused  on  identifying  a  suitable 
method  for  ordinal  data,  which  are  a  particularly  kind  of  categorical  data  where  the  categories  are  ordered. 
The  main  intent  for  this  work  was  to  identify  a  method  analogous  to  the  PQL  for  binary,  i,e.,  a  method  that 
can  produce  an  initial  estimate  for  an  ML  procedure. 

Before  discussing  these  tasks  in  detail,  we  review  the  pertinent  technical  background  in  the  next  section 
and  describe  our  over-arching  vision. 

2.1  Technical  Background 

Mixed-effects  models  are  statistical  models  that  describe  the  behavior  of  a  response  variate  as  it  relates  to 
measured  covariates.  These  models  incorporate  both  fixed-effects  parameters,  which  are  parameters  that 
relate  to  the  entire  population  or  well-defined  subgroups  of  the  population,  and  random-effects,  which  are 
random  variables  associated  with  individual  experimental  units.  They  are  particularly  useful  with  longitu¬ 
dinal  data  —  data  that  are  collected  over  time  on  each  of  several  subjects.  As  described  in  the  program 
solicitation  the  subjects  can  be  grouped  according  to  one  or  more  nested  layers  of  grouping.  For  example, 
soldiers  can  be  grouped  in  squads  that  are  within  platoons,  that  are  within  companies,  and  so  on. 

A  multilevel  model  (Goldstein,  1995;  Snijders  and  Bosker,  1999;  Raudenbush  and  Bryk,  2002;  deLeeuw 
and  Kreft,  2002)  is  another  name  for  a  mixed-effects  statistical  model  that  includes  random-effects  terms  at 
one  or  more  nested  levels.  Multilevel  models  apply  to  longitudinal  data  or  to  general  repeated  measures  data 
that  are  not  necessarily  gathered  over  time. 

Linear  models  are  statistical  models  for  responses  that  are  measured  on  a  continuous  scale.  They  have  the 
property  that  the  prediction  of  the  response  is  a  linear  function  of  the  model  parameters,  or  coefficients,  and 
terms  derived  from  the  values  of  the  covariates.  In  contrast,  a  nonlinear  statistical  model  generally  refers  to 
models  for  responses  measured  on  a  continuous  scale  but  where  the  prediction  of  the  response  is  a  nonlinear 
function  of  the  model  parameters. 

Generalized  linear  models  (McCullagh  and  Nelder,  1989)  are  statistical  models  in  which  a  function, 
called  the  link  function,  of  the  expected  response  is  modeled  as  a  linear  combination  of  model  terms.  GLMs 
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are  frequently  used  to  model  a  dichotomous  response  and  some  variations  of  GLMs  are  used  for  ordinal 
data.  Typical  link  functions  for  dichotomous  responses  are  the  logit  link  or  the  probit  link.  There  is  a  very 
efficient  algorithm  called  iteratively  reweighted  least  squares  (IRLS)  that  is  used  to  fit  GLMs  that  do  not  have 
random-effects  terms. 

A  linear  model  with  both  fixed-effects  terms  and  random-effects  terms  is  a  linear  mixed-effects  model 
The  R  function  Ime  from  the  NLME  package  (Pinheiro  and  Bates,  2000)  calculates  maximum  likelihood 
(ML)  or  restricted  maximum  likelihood  (REML)  parameter  estimates  for  these  models.  A  nonlinear  model 
with  random-effects  terms  is  a  nonlinear  mixed-effects  model.  The  R  function  nlme,  also  in  the  NLME 
package,  calculates  parameter  estimates  for  these  models.  A  generalized  linear  model  with  random-effects 
terms  is  called  a  generalized  linear  mixed  model  (GLMM)  or  a  multilevel  generalized  linear  model  (Ro¬ 
driguez,  2002).  At  present  the  NLME  package  does  not  provide  capabilities  for  estimating  the  parameters  in 
GLMMs.  The  R  function  glmmPQL  in  the  MASS  package  provides  maximum  likelihood  estimates  of  the 
parameters  in  a  GLMM  using  an  approximation  method  called  penalized  quasi-likelihood.  Another  function, 
glmm  in  the  GLMMGibbs  package,  approximates  ML  estimates  through  a  Gibbs  sampler. 


2,2  Technical  approach 

For  this  STTR  project,  we  will  develop  and  implement  in  R  methods  to  determine  the  maximum  likelihood 
estimates  of  parameters  in  generalized  linear  mixed  models  (GLMMs)  for  longitudinal  data.  A  GLMM  with 
one  level  of  random  effects  is  similar  to  the  linear  mixed-effects  (LME)  model 


Vi  —  X$  +  Zibi  -f  €£,  bi  ~ e*  ^ Af(Q^a2I)^  i  =  1?... 

€i±€j,  bi  -L  j;  €i±bj ,  alH,j 


m, 


(1) 


where  yi  is  the  vector  of  length  n*  of  responses  for  subject  i;  is  the  th  x  p  model  matrix  for  subject  i 
with  respect  to  {3\  and  Zi  is  the  rii  x  q  model  matrix  for  subject  i  and  the  random  effects  52  .  The  symbol 
±  indicates  independence  of  random  variables.  The  columns  of  the  model  matrices  Xi  and  Zi  are  derived 
from  covariates  observed  for  subject  L  The  q  x  q  relative  dispersion  matrix  D  is  symmetric  and  must  be 
positive  definite.  Because  it  contains  redundant  entries  and  its  elements  are  constrained  we  do  not  express 
the  likelihood  in  terms  of  it.  Instead  we  use  0,  which  is  any  unconstrained,  non-redundant  set  of  parameters 
that  determine  D, 

The  maximum  likelihood  estimates  for  model  (1)  are  those  parameter  values,  /3,  a2  and  Q  that  maximize 
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the  likelihood  or,  equivalently,  maximize  the  log-likelihood,  of  the  statistical  model  given  the  observed 
data.  The  likelihood  of  the  parameters  given  the  data,  written  L{/3, 0,  a2\y),  is  the  same  expression  as  the 
probability  density  for  the  data  given  the  parameters,  written  p(y\/3, 0,  a2),  where  y  is  the  combined  vector 
of  responses  for  all  the  subjects.  To  obtain  p(y\f3, 0 ,  a2),  we  must  integrate  the  conditional  density  for  each 
subject,  p(yi\bi,  (3,  a2),  with  respect  to  the  distribution  of  the  random  effects  p(bi\0,  a2).  Symbolically 


m 

L((3, 0, o2\y)  =  0,  a2) 

i=l 

p{Vi\bi,  /3,  a2)p(bi\0,  a2)  dbt 


(2) 


In  a  GLMM  it  is  a  function  g(p),  called  the  link  function,  of  the  mean  response  that  is  expressed  as  the 
linear  predictor  Xij3  +  Zxbt.  Furthermore,  the  parameter  a2  is  replaced  by  a  scale  parameter  and,  in  the 
most  common  types  of  GLMMs,  the  scale  parameter  is  constant  and  is  not  estimated.  Thus  we  write  the 
model  as 

g(m)  =  Xi(3  +  Zibi  6i~Af(0,D),  ,  i  =  l,...,m, 

(3) 

Vi  -L  Vji  bi±bj,  i  f  j;  e,  ±  bj,  all  i,j 

where  g(p)  is  the  nj-dimension  vector  obtained  by  applying  g  componentwise  and  p  is  a  known  distribution 
such  as  binomial  or  Poisson.  The  likelihood  has  a  form  similar  to  (2) 

m 

HP,Q\y)  =  Y[p(vi\P,0) 

(4) 

p{yi\bi,(3)p{bi\0)dbi 

with  one  major  difference:  the  integral  in  (2)  has  a  closed-form  expression  but  the  integral  in  (4)  does  not. 
It  is  the  lack  of  a  closed-form  expression  for  this  integral  that  vastly  increases  the  complexity  of  parameter 
estimation  for  GLMMs  relative  to  LMEs  because  we  must  do  an  iterative  optimization  of  an  objective, 
L{0, 0),  that  is  an  integral  which  must  itself  be  approximated  by  numerically  intensive  techniques. 


2,2.1  Approximating  the  integral 

The  glmmPQL  approach  implemented  by  Brian  Ripley  for  the  MASS  package  has  been  reimplemented  in 
Phase  I  of  this  project.  It  uses  iteratively  reweighted  LME  approximations  to  the  GLMM  to  jointly  optimize 
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0  and  6.  Because  of  the  LME  approximation,  an  estimate  of  a 2  is  also  produced  but  not  used.  More  accurate 
approximations  to  the  likelihood  are  obtained  by  approximation  the  integral  in  (4).  As  described  in  the  phase 
I  proposal  we  will  implement  the  deterministic  approximations  called  second-order  Laplacian  and  adaptive 
Gauss-Hermite  quadrature  in  Phase  II. 


As  described  in  Thisted  (1988,  §5.3),  Gaussian  quadrature  is  used  to  approximate  one-dimensional  inte¬ 


grals  of  the  form 


f(x)w(x)  dx 


(5) 


where  w(x)  is  a  weight  function.  For  a  fixed  weight  function  and  interval,  Gaussian  quadrature  rules  can  be 
generated  for  any  order  polynomial.  Gauss-Hermite  quadrature  applies  to  integrals  of  the  form 


f°°  2 

I  =  f{x)e  x  dx  (6) 

J — OO 

which  is  approximately  the  form  that  we  expect  the  integral  / p(yi\bu  0)p(bi\d)  dbt  to  take. 

In  adaptive  Gauss-Hermite  quadrature  we  center  the  integrand  p(yi\bi,  0)  p(bi\6)  about  its  conditional 
mode  bi(0, 6)  before  applying  the  Gauss-Hermite  rule.  This  technique  requires  substantially  fewer  function 
evaluations  than  does  Gauss-Hermite  quadrature  centered  at  6j  =  0  to  achieve  the  same  level  of  accuracy 
in  the  approximation.  When  adaptive  Gauss-Hermite  quadrature  is  reduced  a  single  function  evaluation 
it  is  equivalent  to  the  second-order  Laplacian  quadrature  rule.  Second-order  and  higher-order  Laplacian 
quadrature  rules  are  described  in  Raudenbush  and  Bryk  (2002). 


2.2.2  Determining  bi(0,  Q) 

The  same  iteratively  reweighted  penalized  least  squares  technique  used  in  glmmPQL  to  simultaneously  op¬ 
timize  0  and  the  bi}  i  =  1, . . . ,  m  can  be  used  for  the  conditional  optimization 


bi(0, 6)  =  arg  max p(y i\bi,  0) p(&i|0)  (7) 

Vi 

This  greatly  simplifies  the  code  for  Laplacian  and  adaptive  Gauss-Hermite  quadrature. 

To  demonstrate  this,  follow  the  derivation  in  McCullagh  and  Nelder  (1989,  §2.5)  including  the  penalty 
term. 

In  providing  methods  for  GLMMs  we  will  restrict  our  attention  to  local  linearization  and  to  adaptive 
Gauss-Hermite  quadrature.  The  local  linearization  technique  is  called  penalized  quasi-likelihood  and  has 
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already  been  implemented  by  Prof.  Brian  Ripley  for  the  MASS  package  (Venables  and  Ripley,  1999)  for  R. 
The  local  linear  approximation  to  the  GLMM  model  for  PQL  is  analogous  to  that  used  in  the  IRLS  algorithm 
for  GLMs.  It  replaces  die  likelihood  for  the  generalized  model  by  a  quasi-likelihood  for  which  a  least  squares 
estimate  can  be  calculated.  The  quasi-likelihood  is  based  on  working  residuals  and  weights  calculated  at 
the  current  parameter  estimates.  When  the  estimates  are  updated,  the  weights  and  working  residuals  are 
recalculated,  producing  a  new  weighted  least  squares  problem.  As  described  in  Pinheiro  and  Bates  (2000, 
chapter  2)  the  ML  or  REML  criterion  in  a  linear  mixed-effects  model  can  be  efficiently  evaluated  from  a 
penalized  least  squares  problem.  The  penalized  least  squares  calculation  can  be  applied  to  GLMMs  if  the 
likelihood  is  replaced  by  the  quasi-likelihood,  hence  the  name  penalized  quasi-likelihood. 

PQL  can  be  regarded  as  a  series  of  iteratively  reweighted  linear  mixed-effects  problems  where  the  opti¬ 
mization  of  the  linear  least  squares  problem  uses  penalized  least  squares.  In  fact,  this  is  exactly  Prof  Ripley’s 
implementation.  The  core  of  the  glmmPQL  function  is  a  loop  that  calls  lme  then  checks  for  convergence.  If 
convergence  has  not  been  achieved,  new  weights  and  working  residuals  are  calculated  followed  by  another 
call  to  lme.  In  our  experience  glmmPQL  converges  quickly,  usually  after  4  to  8  calls  to  lme.  The  indi¬ 
vidual  calls  to  lme  are  relatively  fast  but  could  be  made  much  faster.  Fitting  the  100  simulated  data  sets 
from  Rodriguez  and  Goldman  (1995)  took  about  45  minutes  in  R  on  a  1.2  GHz  desktop  computer  running 
Linux.  Each  of  these  100  model  fits  is  a  GLMM  fit  to  2445  observations  with  two  levels  of  random  effects. 
This  is  already  quite  reasonable  performance  for  an  interactive  language  like  R  but  we  have  been  able  to 
reimplement  lme  so  as  to  provide  a  set  of  interfaces  to  the  penalized  least  squares  optimization  at  the  core  of 
lme  that  avoid  unused,  repetitive  work  in  each  of  these  calls  to  lme  and  achieve  greater  stability  and  speed 
in  the  algorithm.  The  stability  is  achieved  because  we  can  use  the  converged  estimates  from  one  iteration  as 
the  starting  estimates  in  the  next  iteration.  We  have  accomplished  this  reimplementation  in  Phase  I  of  this 
project.  We  have  communicated  with  Prof.  Ripley  and  Dr.  Venables  regarding  this  project  and  they  are  quite 
willing  to  continue  to  cooperate  with  us.  We  will  ensure  that  the  R  implementation  of  glmmPQL  uses  the 
direct  interfaces  in  a  reimplemented  lme  whether  glmmPQL  continues  to  be  part  of  the  MASS  package  or 
becomes  part  of  the  NLME  package. 

Especially  with  the  changes  that  we  propose  for  lme,  glmmPQL  is  a  fast  and  reliable  method  for  esti¬ 
mating  parameters  in  a  GLMM.  However,  unlike  GLMs  where  the  IRLS  estimates  are  indeed  the  maximum 
likelihood  estimates,  in  GLMMs  the  PQL  estimates  are  not  necessarily  the  maximum  likelihood  estimates. 
For  example,  the  Rodriguez  and  Goldman  (1995)  data  sets  were  simulated  using  fixed-effects  parameters 
P  =  [0.5, 1.,  1.,  1.]'  and  with  variance  components  of  o\  =  1.0  at  the  family  level  and  a2  =  1.0  at  the  com- 
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mumty  level  (A  GLMM  for  dichotomous  data  is  based  on  a  binomial  distribution  and  does  not  have  a  scale 
parameter  like  a2.)  The  average  parameter  estimates  over  the  100  simulated  data  sets  fit  with  gltnmPQL 
were  $  =  [0.6322,0,9759,0.9427,0,9880]',  #1  =  1.881,  and  d2  =  1.2206.  Compared  with  the  estimates 
from  other  methods,  as  given  in  Rodriguez  (2002,  Table  10.1),  the  estimates  of  the  variance  components  are 
quite  reasonable  but  the  estimates  of  the  fixed-effects  parameters,  like  those  from  other  PQL-based  methods, 
are  biased  downward. 

Systematically  underestimating  the  magnitude  of  the  fixed-effects  estimates  is  a  well-known  property  of 
the  PQL  method.  Producing  better  estimates  involves  using  a  better  approximation  to  the  likelihood  —  in 
this  case  either  a  Laplacian  approximation  or  adaptive  Gauss-Hermite  quadrature.  Both  of  these  techniques 
approximate  integrals  of  functions  that  are  close  to  e~x2  in  form,  Laplacian  integration  (Tierney  and  Kadane, 
1986)  uses  the  value  and  the  Hessian  of  the  log-integrand  at  its  conditional  mode.  Adaptive  Gauss-Hermite 
quadrature  uses  the  value  of  the  log-integrand  at  the  conditional  mode  and  at  some  number  of  nearby  points 
whose  location  is  determined  from  the  Hessian  of  the  log-integrand  according  to  the  Gauss-Hermite  formula. 
Because  the  locations  of  the  log-integrand  evaluations  are  symmetric  about  the  conditional  mode  the  total 
number  of  points  in  each  axis  direction  is  an  odd  number.  Generally  five  or  seven  points  will  produce 
sufficient  accuracy  in  AGQ  evaluation  for  GLMMs.  A  one-point  AGQ  evaluation  is  the  same  as  the  Laplacian 
approximation. 

Notice  that  the  Laplacian  and  AGQ  methods  will  require  evaluation  of  the  conditional  modes  of  the  ran¬ 
dom  effects  for  each  subject  every  time  the  likelihood  is  evaluated  during  the  iterative  optimization  method. 
Obviously,  the  determination  of  the  conditional  modes  will  need  to  be  made  as  efficient  as  possible.  In  our 
redesign  and  reimplementation  of  the  lme  code  we  will  make  provision  for  using  IRLS  within  compiled 
code  for  determining  the  conditional  modes  of  the  random  effects. 

More  accurate  approximations  that  use  Laplacian  approximations  or  adaptive  Gauss-Hermite  quadrature 
are  much  more  compute-intensive  than  PQL.  One  approach  to  parameter  estimation  is  to  use  the  expensive 
approximations  to  the  log-likelihood  throughout  the  entire  iterative  process  of  determining  parameter  es¬ 
timates.  PROC  NLMIXED,  available  in  SAS  (http://www.sas.com)  versions  7,0  and  later,  employs  this 
approach  using  adaptive  Gauss-Hermite  quadrature  throughout.  It  is  an  intensive  approach  and  to  make  it 
feasible  the  SAS  procedure  uses  highly  tuned,  inflexible  code.  As  an  example  of  the  inflexibility,  PROC 
NLMIXED  cannot  be  used  on  the  Rodriguez  and  Goldman  simulated  data  sets  because  it  does  not  allow  for 
nested  random  effects. 

During  an  iterative  algorithm  to  optimize  the  log-likelihood  it  is  not  necessary  to  use  an  expensive  ap- 
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proximation  to  the  log-likelihood  except  when  close  to  the  optimum.  Thus  we  recommend  using  simpler 
approximations  initially  and,  when  the  simple  approximation  has  converged,  switch  to  a  more  expensive 
approximation  for  final  tuning.  Because  the  Laplacian  and  AGQ  evaluations  are  much  more  computationally 
intensive  that  the  PQL  iterations,  it  does  not  make  sense  to  start  with  Laplace  or  AGQ,  We  propose  using 
PQL  initially  until  it  has  converged,  which,  as  mentioned  above,  occurs  quite  quickly,  then  switch  to  Laplace 
or  AGQ  to  finish  the  estimation. 

In  Phase  I  we  concentrated  on  enhancing  the  already  existing  PQL  implementation  at  the  level  of  the  pe¬ 
nalized  least  squares  problem.  The  improved  performance  and  interface  to  the  penalized  least  squares  prob¬ 
lem  will  be  useful  for  the  Laplacian  approximation  and  the  adaptive  Gauss-Hermite  quadrature  approaches 
in  later  phases. 

2.3  Re-implementation  of  the  1  me  function 

Our  specific  tasks  for  Phase  I  included  updating  the  linear  algebra  calls  and  interfaces  in  the  R  functions  lme, 
which  fits  linear  mixed-effects  models,  and  glmmPQL,  which  fits  generalized  linear  mixed  models  (GLMMs) 
by  penalized  quasi-likelihood  (PQL).  This  has  been  accomplished  in  the  Ime4  package  for  R-l.6.2  (released 
Jan.  10,2003), 

The  classes  and  methods  in  Ime4  use  the  formal  classes  and  methods  described  in  Chambers  (1998) 
and  implemented  in  the  methods  package  for  R.  The  underlying  numerical  methods  use  Lapack  (Anderson 
et  al.,  1999)  and  levels  1, 2,  and  3  of  the  BLAS  (Basic  Linear  Algebra  Subroutines).  ATLAS  (Automatically 
Tuned  Linear  Algebra  Software)  implementations  of  Lapack  and  the  BLAS  are  used  by  R  when  available. 
All  calls  to  the  underlying  C  code  use  the  .Call  interface  (R  Development  Core  Team,  2002,  §4,7)  that  enables 
us  to  control  the  number  of  copies  of  objects  that  are  created  during  the  calculations.  This  can  be  important 
when  working  with  large  data  sets  and/or  with  many  terms  in  a  statistical  model. 

These  changes  to  lme  and  glmmPQL  have  produced  a  dramatic  improvement  in  performance.  In  §2,5 
below  we  describe  the  numerical  results  of  glmmPQL  applied  to  100  sets  of  simulated  multilevel  binomial 
response  data.  When  run  on  a  2,0  GHz  Pentium  4  Linux  system  the  previous  version  of  glmmPQL  from  the 
MASS  package  took  an  average  of  30.6  seconds  of  user  time  per  data  set  to  converge.  The  version  in  ime4 
took  an  average  of  6,1  seconds  of  user  time  per  data  set. 

We  have  also  tested  the  ability  of  the  new  versions  to  work  with  large  data  sets  by  fitting  a  multilevel 
linear  model  with  a  fixed-effects  parameter  vector  of  length  47  to  378,047  test  scores  on  134,713  students  in 
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3,722  different  schools.  Each  copy  of  the  model  matrix  for  the  fixed  effects  in  this  model  is  approximately 
150  MB  in  size.  The  new  version  of  lme  was  able  to  fit  this  model  (on  the  same  2.0  GHz  Pentium  4  system 
running  Linux)  in  under  5  minutes  while  keeping  the  total  memory  image  to  less  than  1  GB. 


2.4  Analytical  results  for  optimization 

Another  specific  task  for  Phase  I  is  to  incorporate  the  analytic  gradient  calculations  developed  by  Saikat 
DebRoy  into  the  lme  optimization.  Doing  so  provides  faster  and  more  reliable  convergence  in  the  lme 
function,  which  is  used  both  for  model  fitting  by  itself  and  in  the  iteratively  reweighted  penalized  least 
squares  (IRPLS)  algorithm  used  as  the  inner  loop  in  glmmPQL  fits  of  GLMMs.  In  Phase  II IRPLS  will  be 
used  to  determine  the  conditional  modes  of  the  random  effects  for  the  Laplacian  and  the  Adaptive  Gauss- 
Hermite  Quadrature  fits  of  GLMMs, 

We  give  here  an  overview  of  the  calculations  for  ML  estimation  in  a  linear  mixed-effects  model  single 
level  of  random  effects  (1), 

Calculations  are  based  on  a  precision  factor  A,  which  is  any  matrix  that  satisfies  A' A  =  D1 .  Given 
the  current  Q  we  form  a  series  of  orthogonal-triangular  decompositions 


followed  by 


floo(l)  C0(1)  -Roo  Co 

:  :  =Qo  0  e_i  , 

R00(m)  CCKro)]  [  0  0 


(9) 


These  decompositions  provide  the  profiled  log-likelihood  1(d),  the  conditional  estimates  0(0)  of  the  fixed 
effects,  the  conditional  estimate  cr2(0)  of  the  variance,  and  the  conditional  modes  bfd),  i  =  1. ....  to  of  the 
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random  effects* 


1(0)  =  k  -  nlogabsc-r  +  Y^logabs  ■ 

S'  \Rim\ 

(10) 

0(e)  =  Rqq  co 

(11) 

bi(0)  =  (c^)  -  Rio(i)f}(efj  i  =  1, . . . , m 

(12) 

c2(0)  =  <?-i/n 

(13) 

With  one  more  orthogonal-triangular  decomposition 


bx/a  ...  bm/a  —VA 


we  can  evaluate  the  gradient  of  the  profiled  log-likelihood  as 


d[ 

de 


—  m#fJ) 

i= 1  j= 1 


d*i0) 

de 


(14) 


(15) 


where  (  A'A)^  and  are  the  (i,  j)-th  elements  of  A1  A  and  =  D,  respectively. 

Optimization  of  the  profiled  log-likelihood  with  respect  to  6,  even  with  the  analytic  gradient  (15)  avail¬ 
able,  can  be  a  difficult  optimization  problem  from  poorly-chosen  initial  values.  The  EM  algorithm  and  the 
ECME  (expectation  conditional  maximization  either)  variant  of  the  EM  can  be  used  to  refine  initial  estimates. 
DebRoy  and  Bates  (2003b)  derive  the  ECME  update  for  $  as 


=  \/m  (A-1)' 


The  matrix  A  produced  by  decomposition  (14)  contains  all  the  information  needed  to  calculate  both  the 
ECME  algorithm  and  the  gradient  of  the  profiled  log-likelihood,  and  we  use  this  to  produce  compact  and 
efficient  code  for  the  ime4  package.  A  single  C  function  called  cotntnonDecotnpose  is  called  before 
an  ECME  iteration  and  a  gradient  calculation.  This  function  calculates  and  installs  the  matrix  A  as  the 
updateFactor  slot  of  each  ImeLevel  object  and  all  further  calculations  can  be  expressed  in  terms  of 
that  matrix. 

The  analytic  gradients  and  the  ECME  iterations  are  generalized  to  multiple  nested  levels  of  random  ef¬ 
fects  for  both  maximum  likelihood  (ML)  estimation  and  restricted  maximum  likelihood  (REML)  estimation 
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Effects 

True  value 

Mean 

Median 

Fixed  effects 

Individual 

1.0 

0.9584 

0.9812 

Family 

1.0 

0.9296 

0.9261 

Community 

1.0 

0.9715 

0.9589 

Random  effects  (o) 

Family 

1.0 

1.6760 

1.8494 

Community 

1.0 

1.1794 

1.1678 

Table  1 :  Mean  and  median  coefficients  for  covariates  and  estimated  standard  deviations  of  random  effects  in 
the  simulated  data  sets  from  Rodriguez  and  Goldman  (2001) 

in  DebRoy  and  Bates  (2003a),  All  of  these  results  are  incorporated  into  the  methods  in  the  Ime4  package 
for  R  using  Lapack  and,  when  available,  using  ATLAS  for  maximum  efficiency. 


2.5  Demonstration  using  simulated  data 

We  have  tested  the  new  implementions  of  the  lme  and  glmmPQL  functions  on  several  data  sets  including 
the  simulated  data  described  in  Rodriguez  and  Goldman  (1995)  and  Rodriguez  and  Goldman  (2001).  These 
data  are  100  simulated  sets  of  2445  binary  responses  grouped  into  1558  families  in  161  communities.  The 
grouping  structure  and  the  values  of  covariates  at  each  of  the  levels  of  community,  family,  and  birth  are  based 
on  the  observed  structure  of  the  data  from  a  survey  of  prenatal  care  in  Guatemala, 

The  glmmPQL  function  from  the  lme4  package  converged  on  all  100  of  the  simulated  data  sets.  Sum¬ 
mary  results  are  shown  in  Table  1  and  boxplots  of  the  parameter  estimates  are  provided  in  Figure  1 .  The 
summary  results  can  be  compared  to  those  for  other  methods  as  given  in  Table  1  of  Rodriguez  and  Goldman 
(2001). 

In  evaluating  the  estimates  from  the  simulated  data  recall  that  they  will  be  used  as  starting  estimates  for 
the  more  accurate  adaptive  Gauss-Hermite  quadrature  method  in  the  final  version  of  the  GLMM  estimation. 
The  objective  of  the  PQL  estimation,  which  is  fast  and  simple,  is  merely  to  get  into  the  neighbourhood  of  the 
optimum,  which  these  certainly  do. 

Even  these  fast,  simple  estimates  are  superior  to  those  from  many  of  the  methods  examined  in  Rodriguez 
and  Goldman  (2001).  In  the  simulation  the  parameters  for  which  the  estimates  are  shown  in  Figure  1  were 
set  at  1  then  the  random  noise  was  added.  Estimates  for  the  fixed  effects  are  both  accurate  (small  bias)  and 
precise  (small  variation).  The  estimates  of  o\  and  a%  are  biased  upward  but  they  are  still  suitable  as  starting 
estimates  for  AGQ.  Although  some  of  the  estimates  of  <72  are  zero  this  is  not  alarming.  It  is  quite  possible 
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Figure  1 :  Boxplots  of  the  parameter  estimates  from  glmmPQL  for  the  1 00  simulated  data  sets  from  Rodriguez 
and  Goldman  (2001).  The  parameter  a\  is  the  standard  deviation  of  the  community-level  random  effect,  <t2 
is  the  standard  deviation  of  the  family-level  random  effect,  j3\  is  the  coefficient  of  the  community-level 
covariate,  is  the  coefficient  of  the  family-level  covariate  and  is  the  coefficient  of  the  individual-level 
covariate.  The  notches  in  the  boxes  represent  95%  confidence  intervals  on  the  median  of  the  parameter 
estimates, 

for  the  maximum  likelihood  estimate  of  a  variance  component  from  simulated  data  to  be  zero,  even  when  the 
data  were  simulated  with  a  non-zero  value.  Nevertheless,  we  will  take  this  into  account  when  using  the  PQL 
estimates  as  starting  estimates  for  AGQ  and  adjust  any  zero  estimates  from  PQL  before  beginning  AGQ. 

Similar  results  are  obtained  using  the  original  glmmPQL  function  from  the  MASS  package,  which  calls 
lme  from  the  nlme  package.  The  coefficient  estimates  are  not  necessarily  identical  on  each  of  the  simulated 
data  sets  because  Ripley’s  glmmPQL  restarts  each  call  to  lme  by  calculating  initial  values  whereas  the  Ime4 
version  of  lme  uses  the  most  recent  parameter  values.  This  can  affect  convergence  with  the  Ime4  version 
converging  in  fewer  iterations  on  most  occasions.  As  mentioned  above,  the  Ime4  version  of  glmmPQL  is  5 
times  as  fast  as  the  previous  version  on  these  tests. 

2*6  Modeling  ordinal  data 

As  stated  in  the  project  solicitation,  the  observed  data  from  investigations  and  studies  that  could  be  modeled 
using  this  R  package  are  often  ordinal  in  nature.  Moreover,  the  extant  software  tools  are  quite  limited  for 
modeling  ordinal  data.  Thus,  one  of  the  major  goals  for  the  R  package  is  to  develop  the  software  that  can 
build  an  accurate  GLMM  for  ordinal  data. 

As  discussed  previously,  in  order  to  expeditiously  build  an  accurate  GLMM  for  binary  data  one  must 
proceed  rather  cautiously;  i.e.,  estimating  a  GLMM  using  maximum  likelihood  methods  in  a  straightfor¬ 
ward  fashion  can  result  in  a  large  computational  effort  that  is  unlikely  to  produce  good  parameter  estimates. 
Thus,  we  propose  using  a  Penalized  Quasi-Likelihood  (PQL)  method  to  develop  an  approximate  solution  be- 
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fore  proceeding  to  more  rigorous  computational  methods  such  as  Laplaeian  integration  or  Adaptive  Gauss- 
Hermite  Quadrature  (AGQ).  Recall  that  the  PQL  produces  initial  estimates  for  these  rigorous  methods,  and 
also  may  be  used  to  quickly  investigate  several  model  structures.  This  initial  investigation  provides  a  “first 
cut”  in  the  model  development  process,  and  poor  structures  can  be  eliminated  from  further  consideration 
early  on  in  the  process. 

Unfortunately,  for  ordinal  data  there  is  currently  no  analogous  procedure  to  the  PQL  method.  In  order 
to  model  ordinal  data  in  the  R  package  a  different  approach  is  needed.  Thus,  the  main  goal  for  this  research 
is  to  develop  a  suitable  approach  for  that  could  be  used  for  ordinal  data  in  the  same  fashion  that  the  PQL 
method  is  used  for  binary  data. 

2,6,1  Ordinal  data  and  the  Proportional-Odds  Model 

The  primary  type  of  data  for  this  toolbox  are  composed  of  categorical  variables,  and  often  these  categories 
have  a  natural  ordering.  Such  categorical  variables  with  ordered  scales  are  defined  as  ordinal  data  (Agresti, 
1996).  Data  sets  consisting  of  ordinal  variables  are  quite  commonly  found  in  organizational  and  sociological 
studies,  for  example,  performance  on  a  test  (poor,  fair,  or  good),  attitude  toward  a  proposed  policy  change 
(disapprove,  neutral,  approve,  approve  strongly),  and  so  on. 

Categorical  variables  without  natural  ordering  are  defined  as  as  nominal  or  multinomial  variables.  Exam¬ 
ples  of  multinomial  data  include  religious  affiliation  or  the  kind  of  car  a  person  purchases.  Multinomial  data 
are  also  frequently  found  in  the  social  sciences,  and  a  researcher  could  use  this  package  to  develop  mixed- 
effect  models  for  multinomial  data  sets.  However,  the  inherent  ordering  in  ordinal  data  greatly  simplifies  the 
modeling  process,  and  was  therefore  the  initial  concern  for  Phase  I  of  this  research  project. 

As  discussed  previously,  a  GLMM  model  extends  a  GLM  to  include  random  effects.  Because  a  GLM  for 
ordinal  variables  uses  the  same  fundamental  approach  as  a  GLM  for  binary  variables,  we  briefly  summarize 
the  approach  for  binary  data  here  (MeCullagh  and  Nelder,  1989).  Let  y  denote  a  binary  random  variable, 
and  the  probability  of  success  (indicated  by  a  1)  is  denoted  by  it.  In  the  GLM,  a  model  for  it  is  constructed 
using  a  continuous  variable,  77,  which  is  linearly  related  to  one  or  more  predictors;  this  continuous  variable 
is  related  to  the  categorical  variable  using  a  link  function,  $(•), 

»?(*)  =  Y^&Xi  (16) 

i 

s(tt)  =  T)(x)  (17) 
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where  Xi  are  the  terms  used  to  predict  n.  Although  several  link  functions  may  be  used,  the  logit  link  function 
is  of  particular  interest  for  ordinal  variables: 


g(w)  =  log 


(18) 


For  an  ordinal  random  variable,  suppose  there  are  N  categories  of  interest.  Again  let  y  denote  the 
random  variable,  which  may  now  take  on  N  values  corresponding  to  each  category;  we  label  the  values  for 
the  N  categories  0, . . . ,  N  -  1.  Furthermore,  let  nj  denote  the  probability  of  being  in  category  j,  and  let  7, 
denote  the  cumlative  probability  for  category  j;  i.e.,  the  probability  of  being  in  category  0  to  category  j.  The 
proportional-odds  model  for  ordinal  variables  is  given  by, 


log 


(19) 


Compared  to  (18)  for  binary  variables,  an  additional  parameter  is  introduced,  0j,  which  is  the  threshold  for 
the  jth  category.  The  unique  feature  of  a  proportional-odds  model,  and  the  origin  of  its  name,  is  that  the  ratio 
of  the  odds  for  two  values  of  x  is  given  by, 


TjpEl)/^  ~  TjpEl))  =  -f)T(Xl-X2) 
7j(*2)/(1  -  7i(®2» 


(20) 


which  is  independent  of  the  category  (j).  The  proportional-odds  model  is  illustrated  in  Fig,  2,  for  a  single 
predictor  with  /3  >  0  and  three  possible  categories  for  the  response  variable.  Note  that  the  probability  for 
a  category  is  proportional  to  the  area  under  the  curve  within  the  region  for  that  category.  Note  that  as  x 
increases,  tts  increases  for  (3  >  0;  if  0  <  0,  the  response  probabilities  would  shift  in  the  opposite  direction, 
and  7T\  would  increase. 

As  discussed  by  Hedeker  (2002),  random  effects  may  be  included  in  the  proportional-odds  model  in  the 
same  way  that  random  effects  are  added  to  a  GLM  for  binary  data  and  the  logistic  regression  model.  That  is, 
the  continuous  variable  rj  is  predicted  with  new  terms  for  the  random  effects. 


1 j(®,  z)  =  PTX  +  bTz  (21) 

where  the  random  effects  are  assumed  to  be  normally  distributed  with  mean  zero  and  variance  o'},.  Using 
the  logit  link  function  in  (18)  and  the  thresholds  for  the  proportional-odds  model  in  (19),  the  following 
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Figure  2:  The  response  probabilities  for  the  proportional-odds  model,  j3>  0 
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parameters  must  be  estimated  for  the  mixed-effects,  proportional-odds  model: 

/ 3  Fixed-effects  parameters 
b  Conditional  modes  for  the  random  effects 
a\  Variance  of  random  effects 
8  Thresholds  for  the  proportional-odds  model 

Thus,  the  same  parameters  for  binary  data  and  the  logistic  regression  model  are  estimated,  and  the  thresholds, 
8 ,  are  now  estimated  for  the  proportional-odds  model 

In  principle,  these  parameters  could  be  estimated  directly  using  MLE.  However,  as  in  the  case  for  binary 
data,  the  resulting  optimization  would  be  very  difficult  computationally,  and  unlikely  to  produce  good  param¬ 
eter  estimates.  Furthermore,  the  addition  of  the  thresholds  precludes  the  PQL  approach  from  being  directly 
used  on  the  ordinal  problem.  There  is  no  clear-cut  numerical  method  to  use  as  an  approximate  procedure 
for  estimating  an  initial  estimate  for  the  MLE  and  for  exploring  different  model  structures  during  the  initial 
phases  of  the  model  development  process.  Thus,  we  have  investigated  two  approaches  that  are  somewhat 
heuristic  but  should  be  able  to  provide  an  approximate  solution. 

2,6.2  Continuous  Linear  Mixed-Effects  Modeling 

Perhaps  the  simplest  approach  for  modeling  ordinal  data  would  be  to  ignore  the  categorical  nature  of  the  data 
and  model  the  data  as  if  it  were  continuous.  That  is,  if  the  categorical  random  variable  y  falls  into  one  of  N 
categories,  the  values  for  y  are  labelled  as  y  £  0, . . , ,  N  —  1,  and  these  values  are  modeled  as  if  they  were 
on  the  continuous  scale  and  did  not  represent  N  distinct  categories.  Thus,  the  link  function  in  (17)  becomes 
the  identity,  and  a  linear,  mixed-effects  model  is  used  to  describe  the  date, 

Vc  =  pjx  +  bjz  +  e  (22) 

where  the  subscript  “c”  is  added  to  denote  the  continuous-model  predictions. 

By  ignoring  the  ordinal  nature  of  the  date,  the  linear  mixed-effects  model  in  (22)  may  be  fit  to  the  date 
using  a  single  call  to  the  Ime  function.  Building  models  using  this  approach  is  very  fast,  even  for  large  data 
sets.  Furthermore,  as  the  number  of  categories  increases  in  y,  the  data  begin  to  appear  “more  continuous”, 
and  a  linear  mixed-effects  model  can  be  quite  accurate. 
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Figure  3:  Predictions  using  the  continuous-LME  approach. 

However,  interpreting  the  model  parameters  and  the  model  predictions  in  terms  of  the  true  categories 
becomes  difficult.  As  shown  in  Fig.  3,  the  predictions  of  a  linear,  mixed-effects  model  can  be  quite  correlated 
with  the  categorical  values.  But  there  is  no  clear  meaning  for  model  predictions  that  lie  between  categories, 
or  are  beyond  the  range  of  category  values  (e.g.,  yc  <  0,  or  yc  >  3).  Furthermore,  the  model  parameters 
on  a  continuous  scale  are  not  related  to  the  proportional-odds  model  parameters  on  the  categorical  scale. 
Recall  that  one  of  the  main  reasons  for  approximating  the  data  with  a  simple  model  is  to  provide  good  initial 
estimates  to  a  more  rigorous  method  for  estimating  GLMM  parameters. 

2.6.3  Dichotomization 

An  alternative  approach  is  to  dichotomize  the  ordinal  data,  which  transforms  the  ordinal  problem  into  a 
binary  problem,  for  which  the  PQL  method  may  be  used  to  estimate  model  parameters.  Furthermore,  by 
dichotomizing  the  ordinal  data  in  the  way  that  is  outlined  below,  the  model  structure  for  the  dichotomized 
data  set  directly  corresponds  to  the  structure  of  the  proportional-odds  model;  i.e,,  the  estimated  parameters 
from  this  step  may  be  used  as  the  initial  estimate  for  a  ML  method. 

Once  again,  suppose  that  the  ordinal  random  variable  y  can  be  in  one  of  N  categories,  and  these  eate- 
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gories  are  labeled  such  that  y  €  0, . . . ,  N  —  1.  The  ordinal  data  set  consists  of  n  observations  of  y,  and  is 
denoted  as  the  vector  Y .  These  data  may  be  dichotomized  as  follows: 

1 .  For  each  category,  calculate  yn,j,  a  binary  flag  that  indicates  whether  y  is  in  categories  0  through  j: 


VD,j  = 


0  y<j 
1  V>j 


Note  that  by  this  definition,  yofi  =  1. 

2.  Concatenate  the  results  for  Yn,j  to  create  Yd: 


YD  = 


Yd,  i 
Yd,n-i 


(23) 


(24) 


Note  that  Yd  will  be  length  n  x  (N  -  1). 

3,  Define  an  indicator  variable,  in,  which  corresponds  to  an  (N-l)-level  factor  that  indicates  for  which 
category  the  current  observation  ofy/>  is  calculated: 


iD,j  =  j  +  1  (j  =  1,  ■  ■  • ,  N  -  1) 


(25) 


4,  Re-define  the  model  matrix  for  the  fixed-effects  to  include  i 

=  [fiT  Pi] 

X  =  [x  Xq) 

where  xD  is  the  appropriate  model  matrix  for  an  (N-l)-level  factor  (Pinheiro  and  Bates,  2000),  corre¬ 
sponding  to  is', 

5,  Estimate  the  mixed-effects  model  for  the  binary  data,  where  the  formula  for  7}  is  given  by: 

n  —  0Tx  +  bZ  (26) 


In  terms  of  the  proportional-odds  model  for  the  ordinal  data,  the  estimates  for  /?,  6,  and  can  be  used 
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directly.  Furthermore,  the  model  estimates  for  /?#,  the  intercepts  for  the  N-level  factor  %o,  can  be  used  for 
the  thresholds  0,  in  the  proportional-odds  models. 

Note  that  the  model  predictions  can  be  the  inverted  back  to  the  ordinal  scale.  Given  values  for  x  and 
the  model  can  predict  rj  on  the  linear  scale;  denote  this  prediction  as  rj.  Using  the  inverse  of  the  link 
function  (McCullagh  and  Nelder,  1989),  rj  can  be  converted  into  7^,  predicted  cumulative  probabilities.  The 
cumulative  probabilities  can  be  transformed  into  the  categorical  probabilities  by: 

%  =  %  ~  7j-i 

The  predicted  category,  y,  is  die  category  with  the  largest  value  of  if, 

2,6.4  Proposed  Approach 

The  continuous  approach  is  very  fast  and  might  model  the  ordinal  data  fairly  well  (i.e.,  the  model  predictions 
can  be  quite  correlated  with  the  measured  values),  but  the  parameters  are  not  clearly  recognizable.  The 
dichotomization  approach  predicts  the  data  as  a  categorical  variable,  and  produces  parameter  estimates  that 
can  be  used  as  initial  estimates  for  estimating  the  parameters  of  the  proportional-odds  model;  however,  this 
method  is  slower  than  the  continuous  approach. 

We  propose  using  the  continuous  method  to  estimate  several  models  structures,  (i.e.,  different  levels 
and  sources  of  random  effects,  different  predictors,  and  so  on),  particularly  at  the  initial  stages  of  the  model 
development  process.  Using  model  validation  criteria  such  as  the  Akaike  Information  Criterion  (AIC)  (Bum- 
ham  and  Anderson,  1998),  several  model  strictures  can  be  investigated  quickly,  and  model  structures  that  are 
clearly  inappropriate  can  be  rejected  from  further  consideration. 

Using  the  model  structures  of  interest  from  this  initial  analysis,  the  dichotomization  approach  can  be 
used  to  estimate  approximate  models  for  the  ordinal  data.  Based  on  the  accuracy  of  these  models,  once  again 
using  metrics  such  as  the  AIC,  the  candidate  set  of  model  structures  may  be  reduced  again,  and  the  remaining 
structures  that  pass  this  “second  cut”  can  be  estimated  using  MLE  and  computationally  intensive  methods 
for  numerical  integration,  e.g.,  Laplacian  integration  or  Adaptive  Gauss-Hermite  Quadrature.  Furthermore, 
the  parameter  estimates  from  this  dichotomization  method  can  be  used  as  the  initial  estimate  for  the  more 
numerically  intensive  methods. 
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2.6.5  Simulation  Example 

This  simulation  represents  the  kind  of  data  that  might  appear  in  a  typical  survey*  The  response  variable  is 
the  performance  on  a  test,  and  may  take  on  three  categories:  poor,  fair,  or  good,  which  are  enumerated  as  0 
(poor),  1  (fair),  and  2  (good).  The  predictor  variables  for  the  model  are  experience ,  a  continuous  variable  that 
measures  the  years  of  experience  on  the  job,  and  treatment ,  a  binary  variable  that  corresponds  to  whether  a 
subject  has  received  a  particular  kind  of  training.  The  data  are  collected  for  every  soldier  within  one  regiment, 
and  the  grouping  for  mixed-effects  corresponds  to  the  nested  levels  of  company,  platoon  and  squad.  For  the 
simulation,  the  data  are  balanced  and  there  are  4  companies  in  the  regiment,  2  platoons  in  each  company,  2 
squads  in  each  platoon,  and  10  soldiers  (individuals)  in  each  squad. 

The  response  data  were  simulated  using  a  proportional-odds  model.  Several  proportional-odds  models 
were  used  to  examine  different  model  structures  for  the  random-effects  terms  only.  Denote  the  random- 
effects  terms  as  tjran  and  the  fixed-effects  terms  at  so  that  the  continuous  variable,  r\  in  (21),  can  be 
written  as: 

V  =  Vfix  +  V  ran 

The  fixed-effects  for  each  simulation  were: 


Vfix  —  A)  "F  Pexp%  exp  "I"  0trt3-trt 


(27) 


Results  were  calculated  for  the  following  four  random  effects: 


Vran  —  \ 


h  +  hj  +  bijtk 

b%  +  bij  +  bijk  +  bexPti 

h  +  hj  +  bijyk  +  btri,i 


(  bx.  +  bij  +  &i , 4"  bexp'i  +  btrt,i 


“Intercept” 

“Experience” 

“Treatment” 

“Both” 


(28) 


where  index  i  corresponds  to  the  company-level  effect,  j  is  for  the  platoon-level  effect  (i,e.,  the  jth  platoon 
in  the  if'!l  company,  and  k  is  for  the  squad-level  effect.  Each  of  the  random-effect  coefficients  are  normally 
distributed  random  variables  with  variance,  of  =  1. 

The  design  matrix  was  generated  by  specifying  xexp  as  a  uniformly  distributed  random  variable  between 
0  and  5  years,  and  the  treatments  were  calculated  from  a  binomial  distribution,  with  uniform  probability,  p  = 
0.5,  for  all  of  the  individuals.  Using  30  Monte  Carlo  trials,  results  were  calculated  for  the  four  simulations 
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Table  2:  Average  AAIC  values  for  the  continuous  models. 


Simulation 

None 

Intercept 

Model  Structure 
Experience 

Treatment 

Both 

Intercept 

231 

1.2 

4.3 

2.0 

6.8 

Experience 

307 

69 

1.4 

71 

2.9 

Treatment 

243 

34 

37 

0.29 

5.1 

Both 

318 

72 

17 

59 

0.26 

in  (28), 

Following  the  proposed  approach,  the  ordinal  response  data  were  first  modeled  with  continuous  linear 
mixed-effects  models.  For  each  model  structure  in  (28),  30  Monte  Carlo  trials  were  generated,  and  for  each 
trial,  five  model  structures  were  estimated;  four  are  the  exact  structures  in  (28)  and  the  fifth  model  was  a 
simple  linear  model  (i,e.,  mixed  effects  were  ignored).  For  each  model,  the  AIC  values  were  calculated. 
As  discussed  by  Burnham  and  Anderson  (1998),  the  AIC  metric  can  be  used  to  select  the  most  appropriate 
model  (or  subset  of  models)  from  a  candidate  set  of  model  structures.  Furthermore,  the  AIC  is  not  an  absolute 
measure,  and  should  only  be  used  relative  to  other  models  in  the  candidate  set.  Thus,  for  each  trial,  the  AAIC 
was  calculated  for  each  model  in  the  candidate  set, 

AAIC(j)  =  AIC(j)  -  min  AIC 

where  min  AIC  is  smallest  AIC  for  the  candidate  set  of  models.  The  AAIC  results  for  the  continuous  models 
are  shown  in  Table  2,  avenged  over  30  trials.  For  each  trial,  AAIC  =  0  for  the  selected  model;  averaged  over 
30  trials,  the  best  model  selection  for  these  simulations  has  the  smallest  average  AAIC. 

For  each  simulation,  the  smallest  average  AAIC  value  corresponds  to  the  true  model  structure.  Clearly, 
mixed-effects  models  are  required  as  the  AAIC  for  models  without  random  effects  (the  column  labelled 
“None”)  are  quite  large.  Often  the  AAIC  are  rather  close  (e.g.,  the  “Intercept”  simulations),  indicating  that 
no  model  is  clearly  superior;  although  for  the  “Both”  simulation,  the  AAIC  indicates  that  random  effects 
should  be  included  for  the  experience  and  treatment  slopes.  We  should  note  that  these  simulations  are  very 
fast;  thus,  using  this  continuous  modeling  step,  we  could  quickly  identify  reasonable  model  structures  for 
further  analysis. 

Using  the  proposed  dichotomization  method,  GLMMs  were  estimated  for  the  binary  data.  For  each 
simulation  in  (28),  the  glmmPQL  function  was  used  to  estimate  a  GLMM,  using  the  true  model  structure. 
The  results  from  one  trial  are  shown  in  Table  3,  for  the  “treatment”  simulation  (i.e.,  when  there  are  company- 
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Table  3:  Parameter  Estimates  for  “Treatment”  Simulation  (first  Monte  Carlo  trial) 


Observed 

Predicted 

0  1  2 

0 

156 

47 

7 

1 

13 

57 

48 

2 

0 

3 

29 

Table  4:  Parameter  Estimates  for  “Treatment”  Simulation  (averaged  over  30  trials) 


True 

Estimate  (Average) 

Estimate  (Std.  Dev.) 

Pexp 

0.1 

0.1489 

0.0956 

0trt 

1 

1.829 

0.939 

0i 

1 

1.03 

1.01 

62 

4 

4.72 

1.14 

^(Company, Intercept) 

1 

1.36 

1.25 

^(Company, Treatment) 

1 

1.765 

0.999 

^(Platoon, Intercept) 

1 

1.606 

0.617 

cr&(Squad,  Intercept) 

1 

1.83 

0.37 

level  effects  on  the  intercept  and  slope  for  xtrt).  The  model  predictions  are  relatively  accurate,  especially  for 
the  first  and  third  categories;  the  middle  category  gets  underpredicted.  For  all  of  the  simulations  in  (28),  we 
observed  the  same  behavior,  i.e.,  the  model  predictions  are  worst  for  the  middle  category. 

In  Table  4,  the  true  parameters  are  shown  with  the  estimated  parameters’  means  and  standard  deviations 
(for  30  Monte  Carlo  trials).  Note  that  there  is  a  rather  large  uncertainty  in  the  parameter  estimates,  and  that 
the  mean  values  for  the  estimates  are  greater  than  the  true  values.  This  behavior  also  was  observed  for  all  of 
the  simulations,  and  may  be  symptomatic  of  the  current  version  of  glmmPQL,  as  well  as  the  PQL  solution. 
However,  in  general  the  parameters  are  reasonably  close  to  the  true  values,  and  therefore  appear  to  be  useful 
as  initial  estimates  for  more  refined  numerical  methods  such  as  Laplacian  integration  and  Gauss-Hermite 
Quadrature. 


2.6.6  Results  for  an  experimental  data  set 

Data  collected  from  a  survey  of  ROTC  college  students  were  used  to  evaluate  the  proposed  dichotomization 
approach1 .  In  the  survey,  each  student  was  scored  according  to  three  metrics  related  to  leadership  abilities: 
performance  potential,  group  conflict,  and  psychological  health.  Each  of  these  of  metrics  could  take  on 
three  categories  (enumerated  as  0,  1  and  2).  For  performance  potential  and  group  conflict,  the  categories 
’The  ROTC  data  set  used  in  this  section  was  graciously  provided  by  Major  Paul  Bliese 
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Table  5:  Results  for  continuous  models  of  performance  potential 


Vrand 

A(AIC) 

GPA  (Squad) 

&GPA,s»GPA,i  +  bs 

0 

Gender  (Squad) 

^Gender, s  ^Gender,  i  “P 

2.6 

Gender,  GPA  (Squad) 

&Gender,s®Gender,i  +  ^GPA,s^GPA,i  +  bs 

4.6 

Intercept  (Squad) 

bs 

22.5 

None 

0 

330 

correspond  to  binning  the  students’  scores  into  three  classes:  the  upper  third  (2),  middle  third  (1),  and  lower 
third  (0).  Psychological  health  measures  the  degree  of  depression:  major  (0),  minor  (1),  and  none  (2). 

In  addition  to  the  leadership  metrics,  the  following  data  were  also  collected  for  each  subject:  high  school 
grade  point  average  (GPA),  gender,  efficacy  (a  measure  of  self  confidence),  engagement  (a  measure  of  in¬ 
volvement),  and  adaptability  (a  measure  of  “experimental  adaptability”). 

The  survey  was  conducted  for  eleven  regiments  and  the  regiments  were  grouped  according  to  companies, 
platoons,  and  squads  (two  companies  per  regiment,  four  platoons  per  company,  and  four  squads  per  platoon). 
The  data  set  consists  of  2,990  observations,  although  several  observations  are  incomplete.  After  removing 
these  values,  2,316  observations  remain,  and  the  resulting  data  set  is  unbalanced  -  i,e,,  there  are  varying 
numbers  of  observations  for  the  each  level.  For  small  data  sets,  unbalanced  data  can  be  problematic  because 
there  may  be  very  few  data  points  for  certain  levels.  Furthermore,  even  in  large  data  sets  where  there  are 
several  groups,  there  may  be  very  few  data  for  the  lowest  level  of  grouping.  In  this  case,  there  appear  to  be 
sufficient  observations  for  most  levels,  although  the  squad-level  interactions  should  be  handled  with  care, 
and  the  unbalanced  design  should  not  be  a  problem. 

Results  for  Performance  Potential  Using  Performance  Potential  as  the  response  variable,  we  examined 
several  model  structures  by  first  modeling  the  data  as  if  it  were  continuous.  The  linear  models  indicated 
that  all  of  the  predictors  are  significant,  and  should  be  included.  In  Table  5  the  difference  in  AIC  values 
for  several  models  are  shown.  Clearly,  random  effects  are  needed  to  model  the  data  accurately.  Among 
the  mixed-effects  models,  the  results  indicated  that  effects  may  have  existed  at  the  squad  and  regiment 
levels.  In  Table  5,  results  are  shown  only  at  the  squad  level  for  four  mixed-effects  models.  It  appears  that 
two-dimensional  models  may  be  necessary,  i.e.,  because  the  A(AIC)  for  models  with  mixed-effects  on  the 
intercept  only  was  22,  which  is  quite  different  from  models  that  included  random  coefficients  for  the  slopes. 
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Table  6:  Parameter  estimates  for  fixed  effects,  performance  potential 


$2 

Gender 

GPA 

Efficacy 

Engagement 

Adaptability 

PQL 

-6.85 

-8.42 

-.482 

0.769 

0.303 

0.369 

0.653 

POLR 

-6.66 

-8.18 

-0.481 

0.751 

0,300 

0.383 

0.602 

Table  7:  Parameter  estimates  for  random  effects,  peformance  potential 

_ abr  Q~GPA  ^Gender 

Estimate  1.906  0.673  0.928 

Component  (%)  73  10  17 


For  the  GLMM,  the  following  model  structure  for  the  fixed  effects  was  used: 


Vfix,i  —  ft  X% 


(29) 


Xi  ®GPA,i  ^Gender,?  ^Efficacy,*  ^Engagement,?  ^Adaptability,?  Il,i  ^2 ,i 

ft  [  t^GPA  /^Gender  /^Efficacy  /^Engagement  /^Adaptability  $1  $2  j 

where  I\  and  I2  are  the  indicator  variables  and  6\  and  $2  are  the  corresponding  thresholds. 

In  Table  8,  the  accuracy  of  the  POLR  and  PQL  model  predictions  is  shown,  using  an  “error  matrix”. 
The  true  category  is  shown  in  each  row  of  the  matrix,  and  the  model  predictions  are  shown  in  each  column. 
Thus,  when  the  observed  category  is  0,  the  PQL  model  predicts  Category  0  for  60%,  Category  1  for  24%, 
and  Category  2  for  16%  of  these  observations.  Ideally,  the  error  matrix  is  the  identity,  indicating  that  all 
categories  are  predicted  perfectly.  For  the  PQL  model,  the  diagonal  elements  of  the  error  matrix  are  all 
higher  than  the  corresponding  elements  for  the  POLR  model’s  error  matrix  (i.e.,  the  PQL  model  is  more 
accurate  than  the  POLR  model).  One  metric  for  measuring  the  model  accuracy  is  the  “error  rate”,  which 
equals  one  minus  the  average  of  the  diagonal  elements.  For  the  PQL  model,  the  error  rate  is  48%  versus  the 


Table  8:  Error  matrices  for  performance  potential. 

PQL  POLR 

Predicted  Predicted 


Observed 

0 

1 

2 

0 

1 

2 

0 

0.60 

0.24 

0.16 

0.53 

0.21 

0.26 

1 

0.30 

0.31 

0.39 

0.33 

0.24 

0.43 

2 

0.12 

0.23 

0.65 

0.21 

0.19 

0.60 
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Table  9:  Error  matrices  for  group  conflict. 

PQL  POLR 

Predicted  Predicted 


Observed 

0 

1 

2 


0  1  2 
0.91  0  0.09 

0.63  0  0.37 
0.28  0  0.72 


0  1  2 
0.93  0  0.07 

0.93  0  0.07 

0.88  0  0.12 


PQL  model  includes  squad-level  random  effects  on  the  intercept  and  the  Adaptability  slope. 


55%  for  the  POLR  model.  Thus,  the  addition  of  random  effects  clearly  improves  the  predictability  of  these 
data.  Note  that  for  both  the  POLR  and  PQL  results,  the  middle  category  (i.e.,  the  middle  third  of  performance 
potential)  is  predicted  the  least  accurately;  a  possible  explanation  is  the  thresholds,  8\  and  82  are  rather  close 
to  one  another  compared  to  the  spread  in  the  data. 


Results  for  Group  Conflict  and  Psychological  Health  The  same  modeling  approach  was  applied  using 
the  group  conflict  and  psychological  health  response  variables.  Several  models  with  different  dimensions 
and  groupings  of  random  effects  were  examined,  with  the  continuous  methods  and  proposed  diehotomization 
method.  For  both  response  variables,  only  squad-level  random  effects  appeared  to  be  significant.  The  error 
matrices  for  a  POLR  model  and  a  representative  mixed-effects  model  (estimated  using  PQL)  are  shown  in 
Tables  9  and  10,  respectively. 

For  group  conflict,  neither  the  POLR  or  PQL  model  ever  predicts  the  middle  category.  This  may  be 
the  result  of  two  factors:  the  difficulty  of  predicting  the  middle  category  (i.e.,  the  predictions  tend  to  be 
skewed  toward  the  outer  categories)  and  the  small  number  of  observations  in  the  middle  category  (24%  of  the 
observations  are  in  the  middle  category,  compared  to  47%  in  the  first  category  and  29%  in  the  last  category). 
For  psychological  health,  both  models  nearly  always  predict  the  last  category  (no  signs  of  depression)  and 
therefore  do  not  accurately  model  observations  in  the  first  or  second  category.  This  tendency  is  most  likely 
due  to  the  lack  of  data  for  signs  of  depression:  82%  of  the  respondents  reported  no  signs  of  depression,  14% 
reported  minor  depression,  and  only  4%  reported  major  depression. 

Clearly,  the  mixed-effects  models  estimated  with  PQL  do  not  predict  the  categories  where  there  are 
relatively  few  data  very  accurately.  In  Phase  II,  we  will  investigate  whether  a  more  accurate  mixed-effects 
model  can  be  estimated  using  a  more  numerically  intensive  method  such  as  Gauss-Hermite  Quadrature  (i.e., 
whether  such  a  model  accurately  predicts  categories  where  there  are  few  observations). 
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Table  10:  Error  matrices  for  psychological  health. 


PQL  POLR 

Predicted  Predicted 

Observed  0  12  0  12 

6  (hOl  0.10  0.89  0.01  0.09  0.90 

1  0.01  0.02  0.97  0.01  0.03  0.96 

2  0  0.01  0,99  1  0  0,01  0.99 

PQL  model  includes  squad-level  random  effects  on  the  intercept  and  the  Efficacy  and  Engagement  slopes. 
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3  Key  Research  Accomplishments 

•  Re-implemented  the  Ime  function  in  the  R  NLME  package. 

•  Developed  analytic  results  for  gradient  calculation  used  in  the  lme  function. 

•  Developed  a  method  for  modeling  ordinal  data  approximately. 

•  Demonstrated  the  method  for  ordinal  data  on  simulations  and  ROTC  data  set. 

4  Reportable  Outcomes 

•  Technical  report:  “Computational  Methods  for  Single  Level  Linear  Mixed-effects  Models”  (DebRoy 
and  Bates,  2003b)  (Appendix  A). 

•  Technical  report:  “Computational  Methods  for  Multiple  Level  Linear  Mixed-effects  Models”  (DebRoy 
and  Bates,  2003a)  (Appendix  B). 

•  Presentation:  “Converting  a  large  R  package  to  S4  classes  and  methods”,  to  be  presented  at  the  Dis¬ 
tributed  Statistical  Conference,  March  20-22, 2003,  Vienna  Austria  (Appendix  C). 

•  We  intend  to  publish  a  paper  on  the  proposed  approach  for  modeling  ordinal  data. 

•  The  analytical  results  will  appear  in  Mr.  Saikat  Debroy’s  Ph.D.  dissertation  (degree  expected  to  be 
awarded  Fall,  2003). 

5  Conclusions 

A  prototype  R  package  for  estimating  generalized  linear  mixed-effects  models  has  been  developed  in  Phase 
I  of  this  research.  A  key  function  within  the  package  is  the  lme  function.  In  Phase  I,  the  lme  function 
was  reimplemented  and  is  now  faster  and  more  accurate.  Several  analytical  results  were  developed  that, 
once  implemented  in  the  lme  function,  will  also  improve  the  speed  and  accuracy  of  the  package.  Using 
the  proposed  dichotomization  method,  approximate  GLMMs  may  be  estimated  for  ordinal  data.  The  results 
for  a  simulation  study  and  real  data  sets  indicated  that  these  approximate  models  can  predict  ordinal  data 
reasonably  accurately,  but  tend  to  be  skewed  toward  the  first  and  last  categories  and  do  not  accurately  predict 
in  categories  where  there  are  few  observations.  A  more  refined  numerical  procedure  for  estimating  GLMM 
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parameters,  which  will  be  investigated  and  developed  in  Phase  II,  may  be  needed  to  model  these  situations 
better. 
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Abstract 

Linear  mixed-effects  models  are  an  important  class  of  statistical  models 
that  are  used  directly  in  many  fields  of  applications  and  are  also  used 
as  iterative  steps  in  fitting  other  types  of  mixed-effects  models,  such  as 
generalized  linear  mixed  models.  The  parameters  in  these  models  are 
typically  estimated  by  maximum  likelihood  (ML)  or  restricted  maximum 
likelihood  (REML),  In  general  there  is  no  closed  form  solution  for  these 
estimates  and  they  must  be  determined  by  iterative  algorithms  such  as 
the  EM  algorithm  or  Fisher  scoring  or  by  general  nonlinear  optimizers. 
We  recommend  using  a  moderate  number  of  EM  iterations  followed  by 
general  nonlinear  optimization  of  a  profiled  log-likelihood.  In  this  paper 
we  present  a  method  of  calculating  analytic  gradients  of  the  profiled  log- 
likelihood.  This  gradient  calculation  can  be  implemented  very  efficiently 
using  matrix  decompositions  as  is  done  in  the  nlme  packages  for  R  and 
S-PLUS,  Furthermore,  the  same  type  of  calculation  as  is  used  to  evaluate 
the  gradient  of  the  profiled  log-likelihood  can  be  used  to  implment  an 
ECME  algorithm. 


1  Introduction 

Linear  mixed-effects  (LME)  models  are  widely-used  statistical  models  that  also 
are  used  as  iterative  steps  in  fitting  other  types  of  mixed-effects  models  such 
as  non-linear  mixed-effects  (NLME)  models  and  generalized  linear  mixed  mod¬ 
els  (GLMMs).  Some  forms  of  nonparametric  smoothing  spline  models  can  also 
be  viewed  as  LME  models  (Ke  and  Wang,  2001). 

*This  work  is  supported  by  U.S.  Army  Medical  Research  and  Materiel  Command  under 
Contract  No.  DAMD 17-02-C-01 19.  The  views,  opinions  and/or  findings  contained  in  this 
report  are  those  of  the  authors  and  should  not  be  construed  as  an  official  Department  of  the 
Army  position,  policy  or  decision  unless  so  designated  by  other  documentation. 


The  parameters  in  these  models  are  typically  estimated  by  maximum  likeli¬ 
hood  (ML)  or  restricted  maximum  likelihood  (EEML).  In  this  paper  we  consider 
ML  and  EEML  estimation  of  the  parameters  in  LME  models  with  a  single  level 
of  random  effects.  In  a  companion  paper  (DebRoy  and  Bates,  2003)  we  gener¬ 
alize  our  results  to  models  with  multiple  nested  levels  of  random  effects. 

Laird  and  Ware  (1982)  wrote  the  single-level  linear  mixed-effects  model  as 

Vi  =  X4 3  4-  Zibi  +  bi  ~  Af(G,  <72#-1),  e*  ~  AT(G,  a2 1),  i  =  1, . . . ,  m, 
-L  ?  bj_  _L  fcj ,  %  ^  -L  bj ,  all  i,  j 

(A-l) 

where  yi  is  the  vector  of  length  n*  of  responses  for  subject  i;  Xi  is  the  ti*  x  p 
model  matrix  for  subject  i  and  the  fixed  effects  /3;  and  Z,  is  the  n*  x  q  model 
matrix  for  subject  i  and  the  random  effects  bt.  The  symbol  ±  indicates  indepen¬ 
dence  of  random  variables.  The  columns  of  the  model  matrices  X%  and  Zi  are 
derived  from  covariates  observed  for  subject  i.  The  maximum  likelihood  esti¬ 
mates  for  model  (A-l)  are  those  parameter  values  that  maximize  the  likelihood 
or,  equivalently,  maximize  the  log-likelihood,  of  the  statistical  model  given  the 
observed  data, 

1.1  Log-Likelihood  Function 

When  defining  the  parameter  space  for  the  model  (A-l)  we  must  take  into 
account  that  the  matrix  #  is  required  to  be  symmetric  and  positive  definite. 
Instead  of  using  elements  of  #  as  parameters  we  will  use  0,  which  is  any  set 
of  non-redundant,  unconstrained  parameters  that  determine  #  (some  choices 
for  the  mapping  are  given  in  Pinheiro  and  Bates  (1996))  and  write  the 

log-likelihood  for  model  (A-l)  as 

1  r  m 

i (/3,cr2, 0\y)  =  --  jnlog(27rer2)  -  mlog  |$|  +  ^log \Z[Zi  +  $|] 

i—  1 

1  m 

Xitfil -  ZiiZ'iZi  +  #)-12/)(yj  - Xifi)  (A-2) 

where  y  is  the  concatenation  of  the  yi%i  =  and  n  =  n%  ls  the 

total  number  of  observations. 

It  is  straightforward  to  determine  the  conditional  ML  estimates  /3ml  (0)  and 
and  the  conditional  modes  of  the  random  effects  6*0, 0),  i  —  1, . . . ,  m 
given  0  as 

(m  \  _1  m 

(A-3) 

i=l  /  i=  1 

~  TZi  (vi  -  xSude))^  (Vt  -  xSml(0)) 

<t2ml(B)  = - - - !— - i - L 

n 

bt(P,  0)  =  (Z'Zf  +  $)-1Z'(yl  -  Xi/3) 


(A-4) 

(A-5) 


where 


Si  =  (/  -  ZtiZlZi  +  (A-6) 

1 , 2  Log-Restricted-Likelihood  function 

The  restricted  maximum  likelihood  (REML)  estimate  can  be  obtained  by  max¬ 
imizing 

LR(a2,6\y)  =  J  L(/3,  cr2,  B\y)df5 

where  L(f3,o2y0\y)  is  the  likelihood  function  for  single-level  LME  model.  We 
are  more  interested  in  the  log- restricted-likelihood ,  which  is 

1  r  m 

(<t2, 9\y)  =  -  2  [(n  -  p)  log(2?r a2)  -  mlog  |$|  4-  ]T  log \Z\Zi  +  #| 

i=l 

m 

+  £loglX'S,r1Xi| 

i=l 

1  m  t 

+  ^  E  (»«  -  xiPMi(ej)  S -1  (y*  -  Jr,j§j«(0))]  (A-7) 

i=l 

Because  is  not  a  function  of  {3  there  is  no  “REML  estimate”  of  (3.  It  is, 
however,  traditional  to  use  the  conditional  ML  estimate  (A-3)  evaluated  at  the 
REML  estimate  of  6,  We  can  compute  the  conditional  REML  estimate  of  a2  as 

~  TZi  fa  ~  x^MLie))^-1  (Vi  -  xSml(0)) 

&2reml{&)  — -  (A-8) 


1.3  Estimation  of  6 

Results  (A-3)  and  (A-4)  show  that  we  can  provide  analytic  expressions  to  evalu¬ 
ate  the  profiled  log-likelihood  1(0)  —  £((3ml(Q),  &2ml(&)i  #),  which  is  a  function 
of  6  only.  Using  (A-8)  we  can  also  calculate  the  profiled  log-restricted-likelihood 
£r($)  =  £r(ct2reml(6)‘,  #).  Note  that  although  (A-3)  and  (A-4)  show  that  we 
can  evaluate  a  profiled  log-likelihood,  they  are  not  good  computational  formulas 
for  doing  so.  Efficient  computational  methods  for  evaluating  1(0)  and  IR(0)  are 
described  in  Pinheiro  and  Bates  (2000)  and  summarized  in  §2.3. 

It  is  generally  much  easier  to  determine  the  ML  (or  REML)  estimates  by 
optimizing  £  (or  IR)  rather  than  £  (or  £R)  when  using  a  general  optimization  rou¬ 
tine  because  the  parameter  vector  is  of  smaller  dimension  -  often  much  smaller. 
Another  way  we  can  make  the  optimization  problem  easier  is  by  providing  an¬ 
alytic  gradients  of  the  profiled  log-(restricted)-likelihood.  It  is  possible  to  use 
numerical  gradients  in  optimization  but  analytic  gradients  are  preferred  as  they 


provide  more  stable  and  reliable  convergence.  In  section  2  we  derive  the  ana¬ 
lytic  gradient  of  the  profiled  iog-(restricted)-likelihood  and  show  how  it  can  be 
computed  accurately  and  efficiently. 

A  third  way  to  make  the  optimization  problem  easier  is  to  use  good  starting 
values.  The  EM  algorithm  provides  a  way  of  refining  starting  estimates  before 
beginning  general  optimization  procedures.  In  section  3  we  derive  EM  and 
ECME  iterations  for  ML  (REML)  estimates  of  model  (A-l)  and  show  how  these 
updates  may  be  applied  efficiently. 

Finally  in  section  5  we  discuss  some  connections  between  the  gradient  result 
and  the  ECME  iterations. 


2  The  gradient  of  the  profiled  log-likelihood 


Because  the  partial  derivatives  of  t  with  respect  to  0  and  a2  at  the  conditional 
estimates  and  <j2ml(B)  must  be  zero,  the  first  two  terms  in  the  chain 

rule  expansion  of  the  gradient 


d£ 

dB 


d/3f  d£ 
d0  3(3 


d£  da2 
+  da2  did 


d&ij 

~W 


will  vanish,  leaving  only 


M 

dB 


(A-9) 


When  evaluating  (A-9)  we  must  again  take  into  account  the  symmetry  of 
We  define  a  special  form  of  the  derivative  of  a  scalar  function  with  respect  to  a 
symmetric  matrix  for  this. 


2.1  Derivatives  for  symmetric  matrices 

Let  /{#)  be  a  scalar  function  of  the  q  x  q  symmetric  matrix  #.  We  will  define 
the  symmetric  matrix  derivative  of  /(#)  with  respect  to  to  have  elements 


2  diij  f  (^)  i  ^  i 


(A-10) 


This  definition  has  the  property  that  if  $  is  a  function  of  a  non-redundant 
parameter,  such  as  B ,  then 


dB 


EE 

j= i 


d*g{e) 

dB 


(A-ll) 


with  the  sum  in  (A-ll)  being  over  all  the  elements  of  not  just  the  upper  or 
the  lower  triangle. 


Another  way  of  regarding  definition  (A- 10)  is  to  consider  Y,  a  general  q  x  q 
matrix  (i.e.  not  restricted  to  being  symmetric)  that  happens  to  coincide  with 
Then 

d{$}1{  >  ~  2 

That  is,  the  derivative  /(#)  can  be  obtained  by  differentiating  the  scalar 
function  with  respect  to  #  disregarding  the  symmetry,  then  symmetrizing  the 
result. 

Using  this  approach  and  standard  results  on  matrix  derivatives  (Rogers, 
1980)  we  have  the  following  results  for  #  and  A  symmetric  q  x  q  matrices  and 
«,  /3  vectors  of  dimension  q. 


d  (a'm=al3'  +  l3a' 

d{$}{  P>  2 

(A-12) 

(o'  (A  +  $)“1  /3)  =  -(A  +  $)~1  af3'  +  /3Q'  (A  +  $)-' 

(A- 13) 

d{<f>}tT ~  A 

(A-14) 

^\A  +  $\  =  \A  +  *l(A  +  *)-1 

(A- 15) 

^|ylog|A  +  *|  =  (A  +  *)-1 

(A-16) 

2,2  Gradient  of  the  profiled  log-likelihood 

Using  the  results  of  the  previous  section  w^e  can  express  the  partial  derivative 
of  the  log-likelihood(A-2)  with  respect  to  as 


dl  _  1  01og|*|  ,  ^01og|Z?Zi  +  *| 

a{#}  2  [  m  a{$}  a{$} 

1  m 

-  2^2  E  -  XiP)'(i  ~  z>(z&  +  *)~lzi')(vr  -  xtp) 

if  m 

L  i=l 

i  w 

-  2^  +  *rlz'i(Vi  -  Xi MVi  -  Xtp)'ZdZ[Z%  +  *)-» 


In  the  profiled  log-likelihood  the  last  term  can  be  expressed  using  the  conditional 
modes  (A-5)  providing 


d£  f  d£  1 


-EE  HE 

i— 1  3=1  K  i=l 


(A-17) 


2,3  Efficient  computation  of  the  gradient 

Pinheiro  and  Bates  (2000)  describe  methods  for  evaluating  the  profiled  log- 
likelihood  using  a  relative  precision  factor  A,  which  is  any  qxq  matrix  satisfying 
A'  A  =  4>_1,  and  a  series  of  QE  decompositions  of  the  form 


J  0 


!0(i)  A? 

RoO(j)  j  0 


ci(i)  _  r%t  IVj 


followed  by 


«6U)‘ 

RqO  Co  1 

—  Qo 

0 

C- 1 

Co(m)_ 

0 

0 

(A- 18) 


(A-19) 


These  decompositions  provide  easily  solved  triangular  systems  of  equations 

Roo0(6)  =  Co  (A-20) 

=  Ci(i)  -  Rio{i)$(6)  i  =  1, . . . ,  m  (A-21) 

for  the  conditional  estimates  of  f3  and  the  conditional  modes  of  the  random 
effects.  The  conditional  estimate  of  a2  is  a2(&)  —  cti/n  and  {Z[Zi  +  3>)~1  — 
■^li  (*)^i  i(0* 

Using  these  results  and  taking  one  more  QR  decomposition 
[&1  Mhl&ML  (^n{l))  •**  bmMhl&ML  (^u(m))]  —  UA  (A-22) 


we  can  evaluate  the  gradient  of  the  profiled  log-likelihood  as 

2—1  j—1 


(A-23) 


where  (A1  A) ij  and  are  the  (i,  j)-th  elements  of  A1  A  and  #  1  respectively. 


2,4  Gradient  of  the  log-restricted-likelihood 

Using  the  same  argument  as  used  for  log- likelihood,  to  compute  the  gradient 
of  the  profiled  log-restricted-likelihood,  we  only  need  consider  the  the  partial 


derivative  of  the  log-restricted-likelihood  (A-7)  with  respect  to  $  which  can  be 
computed  to  be 


dta 

&m 


1 

2 

1  d 


gjogjjl  |  ^tflog|Z^  +  »| 


dm 


i=l 


-  2  5{#}5Z1o«  I X'i  i1  ~  zi (z'izi  +  S)'1 Z{ )  Xi\ 

1  ^  Q  t 

2a2  5Z  —  X^ml)  {l  —  Zt(ZtZt  4-  Zi  )  $i  ~  X% 


i=  1 

j  m  /  M  \ 

-  -  ^(Z'z,  +  #)-1Z|Xi  I  X'Zt(Z'Z,  +  #)-> 

-  2^2  -  xSML)(Vi  -  xSuLyZiiziZi  +  *)- 

X  — 1 


Using  (A-5)  and  the  decompositions  in  §2.3,  the  profiled  log-restricted- 
likelihood  can  be  written  as 

dl  _  ^  f  din  1  d$rj 

de  kh\ <«•>;„  * 


=1  j= 
Q  $ 


ij 

— *  — *  s=  1  ^ 

^KO^oCO^oo  t^oo)  ^io(0  (^ii(*>) 

-dHl 

J/y 


i=l  j=l 


(A-24) 


If  we  take  the  QE  decomposition 


brnMlf^REML 


/  ,  \  * 


-urar 


(A-25) 


we  can  compute  the  profiled  log-restricted-likelihood  as 


=  ^2((a'rar)h  -  y  W 

i=  1  i=i 

3  EM  and  ECME  algorithms  for  LME  models 

The  EM  algorithm  (Dempster  et  aL,  1977)  is  a  general  iterative  algorithm  for 
computing  maximum  likelihood  estimates  in  the  presence  of  missing  or  unob¬ 
served  data.  In  the  case  of  LME  models  the  typical  approach  to  using  the  EM 
algorithm  is  to  consider  the  6t, i  =  1, . . . ,m  as  unobserved  data.  In  the  termi¬ 
nology  of  EM  algorithm,  we  call  the  given  data  yi  the  incomplete  data  and  y* 
augmented  by  bi  the  complete  data. 

The  EM  algorithm  has  two  steps:  in  the  E  step,  we  compute  Q,  the  expected 
log-likelihood  for  the  complete  data  and  in  the  M  step  we  maximize  the  expected 
log-likelihood  with  respect  to  the  parameters  in  the  model. 

Liu  and  Eubin  (1994)  derived  the  EM  algorithm  for  LME  models  using  bi 
as  the  missing  data.  In  same  paper  they  also  introduce  expectation  conditional 
maximization  either  (ECME)  algorithms,  which  are  an  extension  of  the  EM 
algorithm.  In  ECME  algorithms  the  M  step  is  broken  down  into  a  number  of 
conditional  maximization  steps  and  in  each  conditional  maximization  step  either 
the  original  log-likelihood  £  or  its  conditional  expectation  Q  is  maximized.  The 
maximization  in  each  step  is  done  by  placing  constraints  on  the  parameters  in 
such  a  way  that  the  collection  of  all  the  maximization  steps  is  with  respect  to 
the  full  parameter  space. 

Liu  and  Rubin  (1994)  provide  an  example  of  an  ECME  algorithm  for  LME 
models.  An  alternative  approach  (van  Dyk,  2000)  uses  other  candidates  for  the 
unobserved  data  in  LME  models  but  we  will  not  pursue  that  here. 

3,1  An  EM  algorithm  for  LME  models 

We  first  describe  the  EM  algorithm  obtained  with  bi  as  the  missing  data.  We 
denote  the  current  values  of  the  parameters  by  /30,  Oq  and  0q,  which  generates 
#o-  These  are  either  starting  values  or  values  obtained  horn  the  last  E  and  M 
steps.  The  parameter  estimates  to  be  obtained  after  an  E  and  an  M  step  are 
ft,  <?\  and  0i,  which  generates  #i.  It  happens  that  in  the  EM  and  ECME  al¬ 
gorithms  we  can  derive  3>i  directly  without  forming  so  we  will  write  formulas 
in  terms  of 


The  log-likelihood  for  the  complete  data  is 
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(A-26) 


and  the  conditional  distribution  of  the  random  effects  is 

bi\yu  A),  4,  $0  ~  AT  (bi(f30,  «0),  (Z[Zi  +  #0)-1og)  (A-27) 

where  6j(/30,  <ho)  is  the  conditional  expected  value  of  (also  the  conditional 
mode)  given  in  (A-5). 

In  the  E  step  we  compute  Q(/3}<j2,  #|y5/30i^o?  ^o)?  the  conditional  expec¬ 
tation  of  £  (/3,  ex2, 3>|y,  6)  as  defined  in  equation  (A-26). 
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(A-28) 

In  the  M  step,  -we  maximize  the  expected  log-likelihood  defined  in  equation 
(A-28).  To  do  that,  we  differentiate  Q(/3,  a2,  #|y,  /30?  0q?  $o)  with  respect  to  /3? 
a-2  and  #. 
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We  use  (A-12),  (A-14)  and  (A- 15)  to  compute  the  matrix  derivative  of  (A-28) 
with  respect  to  4>. 
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Equating  to  the  zero  matrix  and  substituting  <j\  for  cr2 
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Finally  we  differentiate  (A-28)  with  respect  to  a2  to  get 
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Again  equating  to  zero  and  substituting  (3i  for  (3  and  4>i  for  #  we  get 
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We  can  now  substitute  the  value  of  $1  from  (A-31)  in  (A-32)  to  get 
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Using  the  fact  that  tr(A?)  —  tr  (Y^iLi  A)  we  can  show  that  the  first  two 
terms  together  are  equal  to  mqaf.  Therefore 

1  m 

*?  =  -£  {ll»(  -  z&  -  X*M2  + tr  [KZAiZlZi  +  So)"1] }  (A-33) 


3.2  Efficient  computation  of 

To  compute  #i  we  use  the  same  results  as  in  §2,3  except  that  the  conditional 
means  of  the  random  effects  are  scaled  by  <7q,  not  a (6).  That  is,  the  orthogonal- 
triangular  decomposition  we  use  is 

[hiM>  ***  bm/aQ  (-Rn(m))]  =U1Ai  (A-34) 

The  q  x  q  upper  triangular  matrix  A\  provides  as  (Aj"1)^ 


4  An  ECME  variation  to  the  EM  algorithm 

Consider  the  ECME  algorithm  for  estimating  the  LME  model 

1.  Maximize  Q  over  #  by  fixing  a2  to  and  (3  to  (3q  to  obtain  #i. 

2,  Maximize  £  over  /3  and  a 2  by  fixing  #  to  #i  to  obtain  j3\  and  3f . 

Note  that  in  the  second  maximization  step,  the  constraint  is  only  on  4>.  So,  3f 
and  j3\  are  same  as  the  ML  estimates  of  a2  and  f3  given  #  =  #i. 

The  matrix  A  we  will  use  to  compute  this  #i  is  exactly  the  triangular  factor 
from  (A-22)  and 

3,1/2  =  ypo  ^A-iy  =  yg IU-W 
U'o 

By  using  the  ECME  algorithm  we  avoid  computing  the  EM  estimate  of  a2 
at  each  step  and  instead  use  which  is  much  simpler  to  compute.  It  is  clear 


that  at  the  stationary  point,  the  two  updates  coincide  and  furthermore  this 
ECME  algorithm  behaves  at  least  as  well  as  the  EM  algorithm  because  using 
the  ML  estimate  for  a2  can  only  increase  the  value  of  the  log-likelihood.  The 
formal  justification  for  ECME  algorithm  is  a  more  rigorous  proof  of  this  heuristic 
argument. 


4,1  ECME  algorithm  for  REML 

To  obtain  a  REML  ECME  algorithm  we  consider  /3  to  be  part  of  the  missing 
data. 
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The  joint  conditional  distribution  of  /3  and  the  bj’s  is 
bt\0,  Vi ,*l  $0  ~  N  (Si,  (Z'Zi  +  #0 )_1og) 

P\Vi,  *1  *0  ~  V  Uml,  (e  x&7oX^j 


(A-35) 

(A-36) 

(A-37) 


where  Ei0  is  defined  by  (A- 6)  with  #  =  3>0* 

For  the  E  step  we  have  to  compute  Qr{<j2,  &\yya2,  #0)i  the  conditional 
expectation  of  £r. 

Qr(<72,*|J/,CTo.*0) 

=  Ep ly  [Eb^ylR(a2,  #|y)] 


=  E, 


f$\v 


n  +  mq  j  log  ],| 


i= 1  l 

m  (  i 

-E  ■ 

i=l  t 


Zibi(j3,  #0)  -  Xifif  tr  [Z’ZioUZ'Zi  +  #0)"1] 
2<T2  +  2cr2 


I *1/2MA  *o)||2  to  [*aUZjZi  +  *0)~ 

2<r2  2cr2 


We  can  now  specify  our  ECME  algorithm  as 
1,  Maximize  Qr  over  by  fixing  a2  to  5q  to  obtain  #!, 


(A-38) 


2,  Maximize  in  over  a2  by  fixing  #  to  #1  to  obtain  o\  —  g2reml * 

In  this  ECME  algorithm  we  only  need  to  maximize  Qr  with  respect  to 
So  we  only  have  to  compute  the  expectation  of  the  fifth  term  in  (A-38),  Other 
terms  are  either  free  of  #  or  they  are  constants.  Now 
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So  the  gradient  of  Qr  with  respect  to  is 
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We  equate  the  above  to  the  zero  matrix,  use  <r§  —  a2  =  cf2reml  and  the 
decomposition  (A-25)  to  get  the  EM  update  =  \fm  (A^1)/, 

5  Discussion 

Although  not  obvious  from  the  derivations  it  should  not  come  as  a  surprise  that 
the  gradient  and  of  the  ECME  update  for  ML  criteria  both  involve  the  same  qxq 
matrix  A.  If  the  EM  algorithm  has  converged  so  the  matrix  #  is  a  stationary 


point  then  A  =  m#”1/2  and  the  gradient  of  the  profiled  log-likelihood  is  zero, 
as  it  should  be.  The  same  also  holds  for  the  REML  criteria  with  Ar  =  m#-1/2 
at  the  stationary  point. 

The  ECME  algorithm  presented  in  §4  can  be  viewed  as  an  attempt  to  make 
the  gradient  zero  by  setting  #  to  be  the  matrix  that  would  make  the  gradient 
zero  if  A  does  not  change. 
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Abstract 

In  an  earlier  paper  we  provided  easily-calculated  expressions  for  the 
gradient  of  the  profiled  log-likelihood  and  log-restricted-likelihood  for 
single-level  mixed-effects  models.  We  also  showed  how  this  gradient 
is  related  to  the  update  of  an  ECME  (expectation  conditional  maxi¬ 
mization  either)  algorithm  for  such  single  level  models.  In  this  paper 
we  extend  those  results  to  mixed-effects  models  with  multiple  nested 
levels  of  random  effects. 


1  Introduction 

In  an  earlier  paper  (DebRoy  and  Bates,  2003)  we  derived  the  gradient  of  the 
profiled  log-likelihood  and  the  log-restrieted-likelihood  for  a  linear  mixed- 
effects  (LME)  model  with  a  single  level  of  random  effects.  In  the  Laird- 
Ware  (Laird  and  Ware,  1982)  formulation  this  model  would  be  written 

Vi  —  XiP  +  Zib%  +  bi  ~  ^/(0,<t2#  J), 

Si  ~  N{Q,o2I),  (B-l) 

Cj  JL  Sj ,  b%  _L  bjf  i  ^  jj  s%  _L  fej,  all 

*This  work  is  supported  by  U.S.  Army  Medical  Research  and  Materiel  Command  under 
Contract  No,  DAMD17-G2-C-0119.  The  views,  opinions  and/or  findings  contained  in  this 
report  are  those  of  the  authors  and  should  not  be  construed  as  an  official  Department  of 
the  Army  position,  policy  or  decision  unless  so  designated  by  other  documentation. 


where  yi  is  the  veetor  of  length  n*  of  responses  for  subject  i;  X*  is  the 
rii  x  p  model  matrix  for  subject  i  and  the  fixed  effects  /3;  and  Zi  is  the 
T%i  x  q  model  matrix  for  subject  i  and  the  random  effects  b{.  The  symbol 
JL  indicates  independence  of  random  variables.  The  columns  of  the  model 
matrices  X*  and  Z{  are  derived  from  covariates  observed  for  subject  i. 

The  final  computational  formulas  are  based  on  a  relative  precision  factor 
A ,  which  is  any  q  x  q  matrix  satisfying  AfA  —  and  a  series  of  QE 
decompositions  of  the  form 
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and 


***  bmMLl&ML  —  UA  (B-4) 


giving  the  gradient  of  the  profiled  log-likelihood  as 
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where  (AfA)ij  and  are  the  (i,  j)-th  elements  of  A1  A  and  #  1  respec¬ 
tively. 

For  the  log-restricted-likelihood ,  the  objective  function  optimized  by  the 
REML  estimates,  decomposition  (B-4)  is  replaced  by 


K„y 

VmML/°REML 
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(B-6) 


=  uRAR 


Furthermore  an  ECME  update  of  A  can  be  calculated  as 
qI/2  =  y/nm  (A-iy  =  ^ 

for  maximum  likelihood  (ML)  estimation  or  the  same  expression  with  Ar 
in  place  of  A  for  REML  estimation. 

In  this  paper  we  generalize  these  results  to  linear  mixed-effects  models 
with  multiple  nested  levels  of  random  effects. 

1*1  Multilevel  LME  models 
A  general  linear  mixed-effects  model  would  be  written 

y  =  X/3  +  Zb  +  £  (B-7) 

where  y  is  the  response,  X  and  Z  are  fixed  effects  and  random  effects  model 
matrices,  (3  is  the  vector  of  fixed  effects  parameters,  b  is  the  random  effects 
vector  and  €  is  the  error  vector  and  both  b  and  e  are  assumed  to  have 
Gaussian  distributions. 

We  will  consider  specific  forms  of  (B-7)  that  generalize  (B-l)  to  multiple 
nested  levels  of  random  effects.  Such  models  are  sometimes  called  multilevel 
models  (Goldstein,  1987)  or  hierarchical  linear  models  (Raudenbush  and 
Bryk,  2002).  Let  Q  be  the  number  of  nested  levels  of  random  effects.  We 
will  number  the  levels  so  that  the  Q-th  level  is  innermost.  That  is,  the  level-1 
random  effects  apply  to  the  largest  groups  of  experimental  units,  the  level-2 
random  effects  to  the  next  largest  groups  nested  within  the  largest  groups, 
and  so  on,  up  to  level  Q .  The  fixed  effects  j3  apply  to  all  the  observations 
which  could  be  considered  as  a  single  group  of  observations  at  a  hypothetical 
zeroth  level  of  grouping. 

Two  typical  examples  of  nested  classifications  of  experimental  units  re¬ 
sulting  in  such  multilevel  data  are  students  within  classes  within  schools 
within  school  districts  in  an  educational  system  or  soldiers  within  squads 
within  platoons  within  companies  within  regiments  in  an  army.  In  the  ed¬ 
ucational  system  we  would  call  the  school  districts  level  1,  the  schools  level 
2,  and  so  on.  In  the  multilevel  modelling  literature  the  levels  are  often 
numbered  in  the  opposite  order;  from  innermost  to  outermost. 

We  will  assume  that  the  observations  are  ordered  so  that  observations  in 
the  same  group  are  adjacent  for  all  Q  lewis  of  nested  classification.  At  the 
i- th  level  of  classification  we  will  number  the  groups  from  1  to  m* ,  the  total 
number  of  groups  at  that  level  and  write  indices  according  to  the  group 


number  and  the  level.  That  is,  we  will  write  the  level  i  response  vectors 
as  of  length  n^*)*  J  =  1, . , .  ,m*.  We  will  extend  this  notation  to  the 
hypothetical  level  0  for  which  mo  =  1  and  =  n,  the  total  number  of 
observations. 

The  model  matrix  for  the  fixed-effects  is  X  of  size  nxp.  When  discussing 
a  particular  level  of  groups  we  will  refer  to  the  submatrices  consisting  of  the 
rows  of  X  for  group  j  at  level  i  as  X^  of  size  x  p.  Recall  that  “level 
0”  corresponds  to  the  fixed-effects  which  apply  to  all  n  observations  so  that 


y  —  i/o(i)  and  x  =  xm. 

The  random  effects  at  level  fe,  6 =  h  *  *  *  ,rajfc  are  each  of  dimension 
Qk-  The  corresponding  model  matrices  are  of  size  x  q^.  At  times  we 
will  need  to  refer  to  the  rows  of  the  model  matrix  for  the  level  k  random 
effects  corresponding  to  group  j  at  level  i,  which  we  designate  as  The 

model  matrix  Z  for  the  complete  set  of  random  effects  ft  =  (b^ . . . ,  ftQ 
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where  each  Z\  is  defined  to  be 
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o  %ii(2)  0  . . .  0 


L  0  0  0  Zii(mi)  J 


Notice  that  the  model  matrix  Z  can  be  completely  defined  from  components 
of  the  form  Z u(j),  i  =  1, . . .  ?  Q,  j  =  1, . . . ,  m*.  The  notation  Z^q j  for  k  ^  i 
is  used  simply  to  designate  the  rows  corresponding  to  group  j  at  level  i  of 
the  model  matrix  for  the  level  k  random  effects. 

The  model  specification  is  completed  by  specifying  the  distribution  of 
the  random  effects  and  the  random  noise  £ 


e  ~  Af( 0,  a2 1)  fofe(i)  ~  Af( 0,  *) 

-L  bk(j)’  *  j  all  k,  el  bk^,  all  i,  k  bk^  ±  bk,^  k  ±  k',  all  i,j 

(B-8) 

where  _L  indicates  independence  of  two  random  variables.  The  relative  dis¬ 
persion  matrices  i  =  1, . . . ,  Q  are  gathered  as 


Q 

i= 1 


These  matrices  must  be  positive  definite,  symmetric  matrices.  We  use  0,  to 
represent  a  non-redundant,  unconstrained  parametrization  for  #1, . . . ,  #q. 

The  log-likelihood  for  y  is  given  by 

l  (p,o2,  0\y)  =  -i  [nlog(27T(T2)  +  log  (J  +  Z#^Z'| 

+  ^(y  -  XU)' {I  +  Z$>-AlZ')-\y  -  X/3)] 

Using  the  identities 

1 1  +  ZS^Z'l  = 

1  A  1  \*a\ 

(I  +  ZSf-^Z') ~l  =  I-  Z(Z'Z  +  #a)_1Z/ 
we  can  rewrite  the  log-likelihood  as 

l  (0,a2,e\y)  =  [nlog(27r<r2)  +  log  \Z'Z  +  *A|  -  log  |*A| 

+  ^(y  -  X/3)'(/  -  Z(Z'Z+^A)-1Z')(y  -  X/3)]  (B-9) 

For  a  given  value  of  9,  the  conditional  ML  estimates  of  (3  and  a 2  are 
given  by 

Pml(0)  =  SxX'  (I  -  Z(Z'Z  +  ^)"1Z/)  y  (B-10) 

^ml(0)  =  \{v-  XPmL)'{I  -  Z(Z'Z  +  $A)-lZ')  (y  -  X$ML)  (B-ll) 
where 

Sx  =  (x'  (J  -  Z(Z'Z  +  ^)-1Z/)_1  x)_1  (B-12) 

and  the  log-restricted-likelihood  is  given  by 

£r  {a3, 6\y)  =  [(n  -  p)  log(2 ■kcj2)  +  log  | Z'Z  +  -  log  |$4| 

+  log  |X'  (I  -  Z(Z'Z  +  #a)_1Z')  X| 

+  ±(y  -  X/3)' (J  -  Z(Z'Z  +  #a)_1Z0(»  -  X/3)]  (B-13) 


Because  £r  is  not  a  function  of  (3  there  is  no  conditional  “REML  esti¬ 
mate”  of  (3.  (It  is,  however,  traditional  to  use  the  conditional  ML  estimate 


(B-10)  evaluated  at  the  REML  estimate  of  6  as  the  final  estimate  of  f3.)  We 
can  compute  the  conditional  REML  estimate  of  a2  as 

(y-  X0ml)'(I  -  Z(Z'Z  +  #a)-1£')  (y  -  X0ML) 

<72reml(6)  =  ± ^ - >- 

n  —  p 

(B-14) 

As  we  can  obtain  analytical  expressions  for  the  ML  estimate  of  /3  and  the 
conditional  estimates  of  <72,  we  can  optimize  with  respect  to  0  only.  That 
is,  we  optimized  the  profiled  log-likelihood  1(6)  =  £{^ml(6)^c2  ml(6),  6)  or 
the  profiled  log-restricted-likelihood  £r(6)  —  £(/3ml(6), ^hEML($), 0), 


1.2  Efficient  computation  of  the  log-likelihood 

The  decomposition  methods  (B-2)  and  (B-3)  for  evaluating  the  profiled  log- 
likelihood  or  log-restricted-likelihood  were  described  for  both  single-  and 
multiple-level  LME  models  in  Pinheiro  and  Bates  (2000,  chap.  2).  In  De- 
bRoy  and  Bates  (2003)  we  showed  that  these  decompositions  followed  by 
(B-4)  can  be  used  to  calculate  the  gradient  of  the  profiled  objective  func¬ 
tion  and  to  provide  an  ECME  update.  Here  we  generalize  the  calculation 
of  the  profiled  log-likelihood  or  log-restricted-likelihood  to  multiple  levels  of 
random  effects.  Later  we  will  show  how  the  generalization  of  the  gradient 
and  ECME  calculations. 

Consider  the  Cholesky  factorization 
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where  R  is  upper  triangular.  Because  of  the  structure  of  Z  and  we  can 
partition  R  as 
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where  for  i  >  1  Ra  —  diag (jR^(i), . . . ,  each  Ra(j)  is  upper  triangular 

of  size  qi  x  q{.  Because  we  have  ordered  the  observations  so  that  the  groups 


axe  in  adjacent  rows,  we  can  ensure  that  for  i  >  1,  j  >  0 


&ij(  i,i)  0 


0 


R%j  — 


0  -®»i(my(1)+ 1,2) 

®  (!)  +m<;(2)  ,2) 


0 

0 


0 

0 


0 

&ij  (mi-rrii j  (fn. } + 1  ,mj  ) 
Rijimi.mj) 


where  Rij(ki)  is  a  general  matrix  of  size  qi  x  qj  and  is  the  number  of 

level-2  groups  which  are  nested  within  the  fc-th  level- j  group.  Also,  for  i>  1 
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where  e^*)  is  of  size  q\  and  Rio(j)  is  of  size  qi  x  p. 

Instead  of  computing  the  R  matrix  from  a  Cholesky  factorization,  we 
use  a  series  of  QR  decompositions  to  obtain  Ru^}  and  Rij(kt)-  When  Q  =  1 
these  decompositions  are  exactly  (B-2).  The  decompositions  for  Q  —  2  are 
shown  in  detail  in  Pinheiro  and  Bates  (2000,  chap.  2). 

In  terms  of  these  decomposed  matrices  the  profiled  log-likelihood  and 
log-restricted-likelihood  are 
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i=i  fe=i 
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+ 52  5Z  (lo«  I Ai! -  los  l^wl) 
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2  Gradient  of  the  profiled  log-likelihood 

Expression  (B-5)  for  single  level  LME  models  is  derived  in  DebRoy  and 
Bates  (2003)  For  multiple  levels  this  generalizes  to 


(B-15) 


where  denotes  the  partial  derivative  of  £  with  respect  to  the  symmetric 
matrix  as  defined  in  DebRoy  and  Bates  (2003). 

We  can  also  write  it  as  Ylk=i  a{ik)  *  ll^  w^ere  the  mxn  matrix  A  *  B 
is  the  star  product  (Rogers,  1980)  defined  by 


p  q 

[A  *  B}ij  =  EE  l)n+j 

k=  1  1=1 

with  A  of  size  p  x  q  and  B  of  size  mp  x  nq, 

The  partial  derivative  of  (B-9)  with  respect  to  is  given  by 
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In  order  to  compute  the  derivative  with  respect  to  we  need 
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where  =  0  if  i  ±  k  and  =  vec(Igj)  vec'(Igi). 

Let  (ZfZ  +  #a)^  he  the  block  in  #a)_1  that  corresponds  to  the 

Z'Zi  block  in  Z'Z  and  (Z'Z  +  &Af{k)  be  the  block  in  {Z'Z  +  QaT1  that 
corresponds  to  the  ZL^jZ^)  block  in  ZfZ, 

Using  (B-16)? 
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We  can  now  write  the  gradient  of  the  profiled  log-likelihood  as 
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2*1  Gradient  of  the  Log-Restrieted-Likelihood 

The  partial  derivative  of  (B-13)  with  respect  to  is  given  by 
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2*2  Efficient  calculation  of  the  gradients 

As  discussed  in  §1.2  we  do  not  use  the  analytical  formulas  derived  here 
directly  when  computing  the  gradients  numerically.  Instead  we  use  the 
decomposed  matrices. 


First  of  all,  for  i  >  j  define 
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Ti,Q 

Then  the  gradient  of  the  profiled  log-restricted-likelihood  is  given  by 
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3  ECME  Algorithm  for  linear  mixed  effects  mod¬ 
els 

DebRoy  and  Bates  (2003)  presented  an  ECME  algorithm  for  single-level 
LME  models*  Here  we  extend  that  algorithm  to  multiple  nested  levels  of 
random  effects* 

Using  b  as  the  missing  data*  the  log-likelihood  of  the  complete  data  is 
lr  Q 

e(P,a2,0\y,b)  =  --^(n  +  Y^mkqk)log(2Tra2) 

k=  l 

+  ±\\y-X/3-  Zbf  -  log  |#A|  +  ^b'$>Ab]  (B-22) 

In  subsequent  discussion,  we  will  assume  that  the  current  values  of  the 
parameters  are  /3o,  &q  and  Bq.  These  can  be  either  the  starting  values  or 
the  values  obtained  from  the  last  E  and  M  steps*  The  parameter  estimates 
to  be  obtained  at  the  end  of  the  current  E  and  M  steps  are  /3i ,  cr\  and  9\. 
We  also  use  o)  and 

The  conditional  distribution  of  the  random  effects  is  given  by 

%,Ab  crl  do  ~  M  (b,  (Z'Z  +  #Ao)-1^) 


(B-23) 


where  b  is  the  expected  value  of  6  and  is  given  by 

S  =  (Z'Z  +  9>M)-lZ\y  -  Xfo)  (B-24) 

When  evaluated  at  a  these  are  called  the  best  linear  unbiased  predictors 
(BLUPs)  of  b* 

3.1  The  E  step 

In  this  step  we  compute  Q(/3 ,  cr2?  0\y ,  ®o)»  the  conditional  expectation 

of  l  (/3?er2,@|y5&)  as  defined  in  equation  (B-22). 
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3.2  The  M  step 

We  use  the  following  ECME  scheme 

1.  Maximize  Q  over  6  by  fixing  a2  at  a2  and  (3  at  (3$  to  obtain 

2.  Maximize  £  over  f3  and  a 2  by  fixing  6  at  6\  to  obtain  (3\  and  a2. 


We  only  need  to  maximize  Q  with  respect  to  0,  which  is  equivalent  to 
maximizing  with  respect  to  all  the  First  we  must  compute 

the  partial  derivative  of  (B-25)  with  respect  to  <£4. 
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Equating  to  the  zero  matrix  and  substituting  a2  =  a2  =  a2  ml 
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We  can  use  the  decomposition  (B-18)  to  compute  the  update  more  effi¬ 
ciently  as  #J/2  =  s/nTi 


3,3  ECME  algorithm  for  REML 

Again  we  generalize  the  ECME  algorithm  in  DebRoy  and  Bates  (2003)  for 
REML  estimation.  As  explained  there,  we  extend  the  complete  data  set 
to  include  f3  as  if  it  were  another  level  of  random  effects  but  with  a  preci¬ 
sion  matrix  #o  =  0*  (In  a  Bayesian  formulation  this  would  correspond  to 
an  improper,  “locally  uniform5’  prior  distribution.)  We  note  that  the  com¬ 
plete  data  likelihood  £r  (a2, 0|y,/3,  b)  has  exactly  the  same  expression  as  in 
(B-22).  In  the  E  step  we  must  take  expectation  of  (B-25)  with  respect  to  (3 
by  regarding  f3\  as  {3  using  the  conditional  distribution  of  (3 

0\y,  (pML,  (X'SxlX)-1  <rg) 

where  Exo  is  defined  by  using  §a  =  3>ao  in  Ex*  This  gives  us  Qu  — 
Ep,b\y£R 

As  we  noted  in  DebRoy  and  Bates  (2003),  we  only  need  to  deal  with  the 
terms  which  involve  both  <&a  and  (3  in  (B-25)  for  maximizing  with  respect 
to  The  only  additional  term  in  the  derivative  is 

-  [*a(Z'Z  +  $ao)-1Z,X'ExoX’Z(Z,Z  +  **)-!] 

=  -(Z'Z  +  $ao)-1Z,X'ZxoX,Z(Z,Z  +  #A0)_1 

We  must  take  this  term  into  account  when  computing  the  EM  update. 
That  is  done  in  the  decompostion  (B-20)  so  that  we  can  calculate  the  update 
using  #*/2  =  y/rrii  (A^1)'. 
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Abstract 

The  nlme  package  for  fitting  and  examining  linear  and  nonlinear  mixed- 
effects  models  in  R  is  a  required  package  and  also  one  of  the  largest  R 
packages.  In  the  first  phase  of  a  project  to  extend  the  capabilities  of 
the  nlme  package  to  include  generalized  linear  mixed  models  (glmm’s), 
we  reimplemented  linear  mixed-effects  (Ime)  models  using  *34*  classes  and 
methods,  as  described  in  John  Chambers*  book  “Programming  with  Data5* 
and  as  implemented  in  the  methods  package  for  R.  Our  general  goals  for 
this  phase  are  to  incorporate  new  theoretical  and  computational  develop¬ 
ments  for  the  Ime  model  and  to  provide  a  faster,  cleaner  implementation 
of  Ime  fits  in  R  while  including  hooks  for  later  extensions  to  the  glmm 
model  and  the  nlme  model.  In  particular,  we  use  our  reStruct  (random- 
effects  structure)  class  in  iterative  PQL  fits  for  glmm*s,  based  on  Brian 
Ripley’s  function  glmmPQL  from  the  MASS  package. 

As  described  in  “Programming  with  Data”,  classes,  slots  and  inheri¬ 
tance  relationships  must  be  declared  explicitly  when  using  the  methods 
package.  Although  such  formal  declarations  require  package  authors  to  be 
more  disciplined  than  when  using  informal  ‘S3’  classes,  they  provide  assur¬ 
ance  that  each  object  in  a  class  has  the  required  slots  and  that  the  names 
and  classes  of  data  in  the  slots  are  consistent.  This  is  important  to  us  be¬ 
cause  we  are  trying  to  achieve  both  efficiency  and  flexibility.  We  provide 
flexibility  by  defining  many  classes  and  methods  and  by  using  multiple- 
argument  signatures  in  method  declarations.  We  achieve  efficiency  by 
implementing  many  methods  in  C  code  using  the  .Call  interface  and 
through  liberal  use  of  GET_SL0T  and  SET_SL0T  within  the  C  code. 

*This  work  is  supported  by  U.S.  Army  Medical  Research  and  Materiel  Command  under 
Contract  No.  DAMD17-02-C-0119.  The  views,  opinions  and/or  findings  contained  in  this 
report  are  those  of  the  authors  and  should  not  be  construed  as  an  official  Department  of  the 
Army  position,  policy  or  decision  unless  so  designated  by  other  documentation. 


We  feel  that  the  new  implementation  is  much  cleaner  and  easier  to 
understand  than  the  previous  implementation,  due  in  large  part  to  the 
more  extensive  use  of  classes  and  methods.  It  is  definitely  faster  and  can 
handle  larger  problems  than  the  previous  implementation. 


1  Introduction 

The  nlme  package  (Pinheiro  and  Bates,  2000)  for  fitting  and  examining  linear 
and  nonlinear  mixed-effects  models  is  a  large  R  package.  It  consists  of  more 
than  500  R  functions,  3500  lines  of  C  code,  and  40  data  sets  plus  documentation 
and  examples.  The  reason  that  there  are  so  many  functions  in  a  package  that 
is  devoted  to  fitting  just  one  general  type  of  statistical  model  is  to  provide 
flexibility  in  specifying  and  examining  the  models.  We  also  want  to  fit  mixed- 
effects  models  efficiently. 

To  organize  the  large  number  functions  applied  to  different  types  of  objects 
we  created  many  classes  of  objects  representing,  for  example,  grouped  data 
(groupedData),  linear  mixed-effects  model  structures  (ImeStruet),  nonlinear 
mixed-effects  model  structures,  random-effects  structures  (reStruct),  positive- 
definite  parameterized  matrices  (pdMat),  correlation  structures  (eorStruet),  vari¬ 
ance  functions  (varFune)  and  many  different  kinds  of  fitted  models  or  summaries 
or  plots  derived  from  fitted  models.  Most  of  the  500  functions  are  methods  for 
different  classes  of  objects. 

The  computational  methods  described  in  Pinheiro  and  Bates  (2000,  ch.  2,7) 
for  efficiently  evaiating  and  profiling  the  log-likelihood  or  log-restricted-likelihood 
of  a  linear  or  nonlinear  mixed-effects  model  with  multiple,  nested  levels  of  ran¬ 
dom  effects  are  quite  formidable.  Model  matrices  corresponding  to  the  fixed- 
effects  or  to  the  random-effects  terms  in  the  statistical  model  are  combined  and 
decomposed  repeatedly  during  the  iterative  optimization  of  the  objective  func¬ 
tion  to  determine  the  parameter  estimates.  To  achieve  a  reasonable  level  of 
efficiency  in  fitting  models  we  coded  the  compute-intensive  parts  of  the  calcula¬ 
tions  in  C. 

We  have  begun  a  project  to  extend  the  capabilities  of  nlme  to  fit  general¬ 
ized  linear  mixed  models  (Raudenbush  and  Bryk,  2002,  ch.  10),  beginning  with 
the  method  implemented  by  Brian  Ripley  in  the  glmmPQL  function  from  the 
MASS  package  but  also  implementing  estimation  methods  based  on  Laplacian 
and  adaptive  Gauss-Hermite  approximations  to  the  integral  of  the  conditional 
density  of  the  random  effects. 

In  the  first  phase  of  this  project  we  have  reimplemented  the  data  struc¬ 
tures  and  computational  algorithms  for  linear  mixed  models  as  ‘S4’  classes  and 
methods.  Our  objectives  for  this  reimplementation  are: 

*  To  encapsulate  the  underlying  structures  for  linear  mixed  models  in  such 
a  way  that  they  can  be  extended  to  generalized  linear  mixed  models  and 
to  nonlinear  mixed  models. 

*  To  incorporate  new  theoretical  and  computational  developments  for  the 


Ime  model.  We  have  derived  the  analytic  gradient  of  the  profiled  log- 
likelihood  (or  log-restricted  likelihood)  of  a  linear  mixed  model  (DebRoy 
and  Bates,  2003a, b)  and  have  related  the  gradient  results  to  an  ECME 
(expectation  conditional  maximization  either)  optimization  step.  The  an¬ 
alytic  gradient  allows  for  faster  and,  more  importantly,  more  stable  opti¬ 
mization. 

•  To  convert  the  numerical  linear  algebra  calls  from  Linpack  and  BLAS- 
1  calls  to  Lapack  (Anderson  et  al,5  1992)  and  BLAS  levels  1,  2,  and  3. 
Lapack  provides  state-of-the-art  algorithms  and  can  provide  a  substantial 
performance  boost  when  ATLAS  (Automatically  Tuned  Linear  Algebra 
Software)  implementations  of  the  BLAS  and  some  Lapack  routines  are 
available. 

•  To  switch  all  calls  of  C  code  to  the  .  Call  interface  so  that  entire  R  objects 
can  be  passed  to  and  from  the  C  code.  This  also  allows  direct  access  to 
the  slots  of  S4  classed  objects  from  within  C  code. 

•  To  monitor  the  number  of  copies  of  objects  that  are  created,  especially 
those  created  within  iterative  algorithms.  In  §5  we  discuss  an  example  of 
a  model  fit  to  375,000  observations  on  135,000  subjects  grouped  into  3722 
groups.  The  model  matrices  for  the  fixed  effects  can  have  as  many  as  40 
or  50  columns,  or  about  150  MB  for  each  copy  of  the  model  matrix  and 
information  derived  from  it.  We  need  at  least  three  arrays  of  this  size  to 
keep  track  of  all  the  information  we  use.  We  do  not  want  to  create  more 
than  that  if  we  can  avoid  doing  so. 

1.1  S3  versus  S4  classes  and  methods 

Object-oriented  programming  is  a  powerful  tool  for  organizing  the  representa¬ 
tion  of  information  (classes)  and  the  actions  that  are  applied  to  these  represen¬ 
tations  (methods),  A  system  of  classes  and  methods  for  the  S  language  was  in¬ 
troduced  in  Chambers  and  Hastie  (1992).  Wre  wall  call  this  the  SS3’  class  system, 
to  distinguish  it  from  the  4S4?  class  system  described  in  Chambers  (1998)  and 
implemented  for  R  in  the  methods  package.  Unlike  object-oriented  languages 
such  as  Java  and  C++  where  methods  are  associated  with  a  class  definition, 
both  the  33  and  the  34  systems  associate  methods  with  the  combination  of  a 
generic  function  and  the  classes  of  one  or  more  of  the  arguments  to  that  function. 

S3  classes  are  informal:  the  class  of  an  object  is  determined  by  its  class 
attribute,  which  should  consist  of  one  or  more  character  strings,  and  methods 
are  found  by  combining  the  name  of  the  generic  function  with  the  class  of  the 
first  argument  to  the  function.  If  a  function  having  this  combined  name  is 
on  the  search  path,  it  is  assumed  to  be  the  appropriate  method.  Classes  and 
their  contents  are  not  formally  defined  in  the  S3  system  -  at  best  there  is  a 
“gentleman  ?s  agreement”  that  objects  in  a  class  will  have  certain  structure  with 
certain  component  names. 


By  contrast,  34  classes  must  be  defined  explicitly.  The  number  of  slots  in 
objects  of  the  class,  and  the  names  and  classes  of  the  slots,  are  established  at 
the  time  of  class  definition.  During  computation  with  objects  from  the  class 
they  can  be  validated  against  the  definition.  As  in  many  other  object-oriented 
systems,  an  S4  class  can  be  declared  to  inherit  from  another  class  so  S4  classes 
can  be  arranged  in  a  hierarchy, 

S4  also  requires  formal  declarations  of  methods,  unlike  the  informal  system 
of  using  function  names  to  identify  a  method  in  S3.  An  S4  method  is  declared 
by  a  call  to  setMethod  giving  the  name  of  the  generic  and  the  “signature”  of  the 
arguments.  The  signature  identifies  the  classes  or  one  or  more  named  arguments 
to  the  generic  function.  Special  meta-classes  named  ANY  and  missing  can  be 
used  in  the  signature, 

S4  generic  functions  can  be  declared  by  a  call  to  setGeneric  or  they  can  be 
automatically  created  by  declaring  a  method  for  an  existing  function,  in  which 
case  the  function  becomes  generic  and  the  current  definition  becomes  the  default 
method. 


2  Package  conversion:  creating  S4  classes 

The  principle  generic  functions  for  mixed-effects  models  and  the  methods  as¬ 
sociated  with  them  were  already  defined  in  version  3.1  of  the  nlme  package. 
Although  these  generics  and  methods  would  be  modified  to  some  extent  during 
the  conversion  to  S4  classes  and  methods,  we  could  initially  assume  these  would 
stay  as  they  are  and  concentrate  instead  on  determining  what  classes  should  be 
defined  and  how  we  should  define  them. 

We  could  use  the  informal  set  of  classes  from  the  S3  version  as  a  guide  when 
formulating  the  34  classes.  We  found,  however,  that  we  frequently  reconsidered 
the  structure  of  the  classes  dining  the  conversion  and  usually  ended  up  adding 
more  slots  to  the  classes  than  had  been  present  in  the  informal,  SS3?  version. 

Consider,  for  example,  the  pdMat  class  of  parameterized,  positive  definite, 
symmetric  matrices.  It  is  a  virtual  class  (which  means  that  there  will  be  objects 
of  classes  that  inherit  from  pdMat  but  there  will  never  be  objects  whose  only 
class  is  pdMat).  The  matrices  represented  by  objects  that  inherit  from  this  class 
are  determined  by  a  non-redundant,  unconstrained  vector  of  parameters.  In 
some  parameterizations  the  dimensions  of  the  matrix  can  be  determined  from 
the  length  of  the  parameter  vector  but  in  others,  such  as  pdldent,  representing 
multiples  of  the  identity,  or  pdCompSymm,  representing  matrices  with  compound 
symmetry,  the  parameter  vector  has  a  fixed  length  and  the  number  of  columns 
in  the  matrix  must  be  stored  separately.  We  decided  to  add  an  Ncol  slot  to  all 
the  pdMat  classes  and  did  so  by  declaring  it  in  the  virtual  pdMat  class.  In  fact, 
the  pdMat  class  declares  six  slots  that  must  be  present  in  all  classes  inheriting 
from  it, 

setClass ("pdMat”,  #  parameterized  positive-definite  matrices 

representation(form-ttformula*t,  #  a  node! -matrix  formula 
Names=H character" t  #  column  (and  row)  names 


param= "numeric",  #  parameter  vector 

Ncol=" integer" ,  #  number  of  columns 

factor="matrixn ,  #  factor  of  the  pos-def  matrix 

logDet= "numeric "  #  logarithm  of  the  absolute  value 

##  of  the  determinant  of  the  factor  (i.e,  half 
##  the  logarithm  of  the  determinant  of  the  matrix) 

), 

prototype (form= formula (NULL) , 

Names-character (0) , 
param=numeric(0) , 

Ncol=as ,  integer  (0) , 
factor=matrix  (numeric  (0)  ,0,0), 
logDet^numeri c (0) ) 

) 

After  this  definition  most  of  the  class  definitions  for  other  pdMat  classes  are 
trivial. 

setClass("pdSymm" ,  pdMat 11 )  #  general  symmetric  pd  matrices 

setClass ("pdScalar" ,  "pdSymm”)  #  special  case  of  positive  scalars 
setClass(updLogChol" ,  UpdSymm”)  #  default  parameterization 
setClass ("pdNatural",  UpdSymm”)  #  log  sd  and  Fisher *s  z  of  correlation 
setClass ("pdMat rixLog ",  " pdSymm H)  #  matrix  logarithm  parameterization 
setClass( "pdDiag" ,  "pdMat")  #  diagonal  pd  matrices 

setClass("pdIdent ”,  “pdMat")  #  positive  multiple  of  the  identity 

setClass("pdCompSymm" ,  "pdMat")  #  compound  symmetric  pd  matrices 

Increasing  the  number  of  slots  may  be  an  inevitable  consequence  of  revis¬ 
ing  the  package  (we  tend  to  add  capabilities  more  frequently  than  we  remove 
them)  but  it  may  also  be  related  to  the  fact  that  S4  classes  must  be  declared 
explicity  and  hence  we  consider  the  components  or  slots  of  the  classes  and  the 
relationships  between  the  classes  more  carefully. 

3  Calling  C  functions  with  .Call 

The  .Call  interface,  through  -which  a  programmer  can  pass  mw  R  objects  to 
C  code  and  receive  raw  R  objects  from  the  C  code,  has  been  part  of  R  for 
several  years.  It  was  inspired  by  the  .  Call  interface  for  S  described  in  Cham¬ 
bers  (1998).  Several  C  macros  for  working  with  S4  classed  objects,  including 
GET_SLOT,  SET.SL0T,  MAKE_CLASS  and  NEW,  all  described  in  Chambers  (1998)  are 
are  now  available  in  R,  or  will  be  in  R-1,7.0.  (Note  that  the  macro  NEW_ OBJECT 
is  preferred  to  NEW  when  writing  code  for  R  only.  These  two  macros  have  the 
same  effect  but  NEW_ OBJECT  is  less  likely  to  conflict  with  other  definitions.) 

The  combination  of  the  formal  classes  of  S4,  the  .  Call  interface,  and  these 
macros  allows  a  programmer  to  manipulate  34  classed  objects  in  C  code  nearly 
as  easily  as  in  R  code.  A  common  idiom  is  to  have  an  S4  method  call  C  code 


through  the  .Call  interface.  In  the  C  code  the  values  of  slots  are  extracted 
with  GET_SL0T  and  either  modified  in  place  or  used  to  create  slots  for  newr 
objects.  Such  new  objects  are  created  and  populated  by  calls  to  MAKE_CLASS, 
NEW.OBJECT,  and  SET_SL0T. 

Because  the  C  code  is  called  from  a  method,  the  programmer  can  be  confident 
of  the  classes  of  the  objects  passed  in  the  call  and  the  classes  of  the  slots  of  those 
objects.  Much  of  the  checking  of  classes  or  modes  and  possible  coersion  of  modes 
that  is  common  in  C  code  called  from  E  can  be  bypassed. 

We  found  that  we  would  initially  write  methods  in  R  then  translate  them 
into  C  if  warranted.  The  nature  of  our  calculations,  frequently  involving  multi¬ 
ple  decompositions  and  manipulations  of  sections  of  arrays,  was  such  that  the 
calculations  could  be  expressed  in  R  but  not  very  cleanly.  Once  we  had  the 
R  version  working  satisfactorily  we  could  translate  into  C  the  parts  that  were 
critical  for  performance  or  were  awkward  to  write  in  R.  An  important  advantage 
of  this  mode  of  development  is  that  we  could  use  the  same  slots  in  the  C  version 
as  in  the  R  version  and  create  the  same  types  of  objects  to  be  returned. 

We  feel  that  defining  S4  classes  and  methods  in  R  then  translating  parts  of 
method  definitions  to  C  functions  called  through  .  Call  is  an  extremely  effective 
mode  for  numerical  computation.  Programmers  who  have  experience  working  in 
C++  or  Java  may  initially  find  it  more  convenient  to  define  classes  and  methods 
in  the  compiled  language  and  perhaps  define  a  parallel  series  of  classes  in  R.  (We 
did  exactly  that  when  creating  the  Matrix  package  for  R.)  We  encourage  such 
programmers  to  try  instead  this  method  of  defining  only  one  set  of  classes,  the 
S4  classes  in  R,  and  use  these  classes  in  both  the  interpreted  language  and  the 
compiled  language. 

3,1  Using  replacement  methods 

The  .Call  interface  is  a  powerful  tool  and,  like  many  powerful  took,  must  be 
used  earefully  if  you  are  to  avoid  hurting  yourself  with  it.  The  programmer 
must  be  aware  that  the  arguments  are  not  copied  when  they  are  passed  through 
.Call  even  though  the  semantics  of  R  function  calls  require  that  arguments 
must  not  be  modified  by  a  function  call.  If  you  are  to  modify  the  value  of  an 
R  object  passed  as  an  argument  you  must  somehow  copy  its  storage,  usually 
by  duplicating  it  or  coercing  it  to  another  mode,  before  making  any  changes. 
Failure  to  do  so  can  result  in  bugs  that  are  extremely  difficult  to  diagnose. 

Duplicating  or  coercing  R  objects  will  usually  require  that  the  result  be 
protected  from  the  garbage  collector  by  a  call  to  the  PROTECT  macro.  The  effect 
of  all  calls  to  PROTECT  must  be  undone  by  calling  UNPROTECT  before  returning 
from  the  C  function.  Keeping  track  of  what  has  been  protected  can  sometimes 
be  tedious. 

There  is  a  one  exception  to  the  “don’t  modify  the  arguments”  rule:  a  replace¬ 
ment  function  or  a  replacement  method  is  allowed  to  modify  its  first  argument. 
Because  the  class  of  the  result  is  the  same  as  the  class  of  the  first  argument  it 
is  common  to  use  this  argument  as  the  return  value,  after  suitable  modification 
of  its  contents.  We  found  that  we  used  replacement  methods  more  frequently 


in  our  code  than  we  had  first  expected.  We  tended  to  think  in  steps  of  creating 
an  object,  then  modifying  it  according  to  the  values  of  other  objects. 

For  example,  a  linear  mixed-effects  model  is  represented  by  an  reStruct 
object.  The  core  part  of  the  code  to  fit  such  a  model  is 

re  <-  reStruct  (fixed  -  fixed,  random  *  random , 

data  =  eval(mCall,  parent  .frame  ()) , 

REML  =  method  !=  HML H) 

EMsteps(re)  <-  controlvals 
LMEoptimize(re)  <-  controlvals 

where  we  construct  the  reStruct  object,  perform  some  number  of  EM  updates 
on  it  then  perform  general  nonlinear  optimization  on  it. 

4  Use  of  Lapack  and  ATLAS 

Unpack  and  Eispack  routines  are  used  for  numerical  linear  algebra  in  E  since 
its  inception  and  are  part  of  the  R  API.  The  Linpack  and  Eispack  packages 
have  been  largely  supercede  by  Lapack  (Anderson  et  ah,  1992)  which  provides, 
in  some  cases,  better  algorithms  and,  in  nearly  all  cases,  more  effective  imple¬ 
mentations  of  the  algorithms.  Some  of  the  effectiveness  of  the  implementations, 
especially  for  large  arrays,  comes  from  more  extensive  use  of  the  Basic  Linear 
Algebra  Subroutines  (BLAS).  As  the  name  implies,  these  are  basic  routines  for 
doing  operations  like  multiplying  two  matrices  or  replacing  y  by  ax  -f-  y.  Even 
in  sophisticated  linear  algebra  algorithms,  the  majority  of  the  numerical  com¬ 
putation  takes  place  in  these  basic  operations,  hence  it  is  worthwhile  devoting 
considerable  effort  to  optimizing  these  routines.  ATLAS  (Automatically  Timed 
Linear  Algebra  Software)  is  a  collection  of  highly  optimized  BLAS  routines  that 
can  be  compiled  for  different  architectures.  The  combination  of  Lapack  and  AT¬ 
LAS  can  give  a  considerable  performance  boost  to  algorithms  that  use  numerical 
linear  algebra  extensively. 

The  R  Core  Development  Group  (primarily  Brian  Ripley)  has  been  migrating 
R  from  Linpack  and  Eispack  to  Lapack.  Beginning  with  R-l.7.0  the  double 
precision  Lapack  routines  and  some  of  the  double  precision  complex  Lapack 
routines  will  be  part  of  the  R  API.  We  converted  all  the  linear  algebra  in  the 
lme  calculations  to  Lapack,  with  gratifying  results  as  described  in  the  next 
section. 


5  Timing  results 

Rodriguez  and  Goldman  (1995)  simulated  100  sets  of  2445  binary  responses 
grouped  into  1558  families  in  161  communities  and  fit  generalized  linear  mixed 
models  with  two  levels  of  random  effects  to  these.  The  implementation  of  glmm- 
PQL  in  the  new  version  of  the  package  is  roughly  5  times  as  fast  on  these  fits  as 
the  previous  version  that  used  repeated  calls  to  lme. 


We  also  lit  a  linear  mixed-effects  model  to  378047  mathematics  scores  of 
134713  students  on  the  Texas  Assessment  of  Academic  Skills  (TA AS) »  The  data 
were  all  the  test  scores  of  students  in  grades  3  to  8  in  Dallas,  Texas  during  1994 
to  2000.  The  particular  model  that  we  fit  had  a  fixed-effects  vector  of  length 
47,  resulting  in  very  large  model  matrices.  A  fit  with  the  previous  version  of 
the  Ime  function  took  993  seconds  of  user  time  (1093  seconds  elapsed  time)  on  a 
2.0  GHz  Pentium-4  machine  with  1,0  GB  of  PC-2700  memory  running  Debian 
GNU/ Linux.  The  new  version  of  Ime  took  221  seconds  user  time  (345  seconds 
elapsed  time)  on  the  same  machine. 

6  Conclusions 

We  feel  that  we  have  met  our  objectives  in  reimplementing  the  Ime  part  of 
the  nlme  package  using  £S4*  classes  and  methods.  Although  the  performance 
boost  from  using  Lapack  and  ATLAS  is  gratifying,  we  feel  that  the  biggest  gain 
is  in  making  the  code  much  cleaner  and  easier  to  understand  and  in  exposing 
interfaces  that  can  be  used  by  models  that  extend  the  linear  mixed-effects  model. 

Code  clarity  is  enhanced  by  the  fact  that  S4  classes  and  the  .Call  interface 
allow  programmers  to  work  with  the  same  class  definitions  in  R  code  and  in  C 
code.  We  have  also  found  that  liberal  use  of  replacement  functions  and  methods 
allows  us  to  maintain  control  of  the  number  of  copies  of  objects  being  created. 
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